You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Daniel Nugent <nu...@gmail.com> on 2020/03/30 13:31:05 UTC

Attn: Wes, Re: Masked Arrays

Didn’t want to follow up on this on the Jira issue earlier since it's sort of tangential to that bug and more of a usage question. You said:

> I wouldn't recommend building applications based on them nowadays since the level of support / compatibility in other projects is low.

In my case, I am using them since it seemed like a straightforward representation of my data that has nulls, the format I’m converting from has zero cost numpy representations, and converting from an internal format into Arrow in memory structures appears zero cost (or close to it) as well. I guess I can just provide the mask as an explicit argument, but my original desire to use it came from being able to exploit numpy.ma.concatenate in a way that saved some complexity in implementation.

Since Arrow itself supports masking values with a bitfield, is there something intrinsic to the notion of array masks that is not well supported? Or do you just mean the specific numpy MaskedArray class?

If this is too much of a numpy question rather than an arrow question, could you point me to where I can read up on masked array support or maybe what the right place to ask the numpy community about whether what I'm doing is appropriate or not.

Thanks,


-Dan Nugent

Re: Attn: Wes, Re: Masked Arrays

Posted by Micah Kornfield <em...@gmail.com>.
I think the decision to not require a specific value, is it allows mutating
the only the nullability of an element without touching the value which in
some contexts can be useful.

As an implementation specific details this might be something that can be
added to metadata and propagated accordingly within the C++
implementation/compute kernels.


On Tue, Apr 7, 2020 at 3:04 AM Felix Benning <fe...@gmail.com>
wrote:

> I guess it would be helpful, when trying to achieve zero-modification
> between R and another language, if the standard used for communication
> would allow for that. Or when setting all nulls to zero for an algorithm
> and then saving it to a database for later use. But at the same time, I
> only know about this project since a couple of days ago, so my opinion
> on this is likely uneducated at best. I am mostly curious about how this
> pans out and what the trade offs are, since I went down this rabbit hole
> of "how to handle nulls" this far, I might as well bottom out ;-).
>
> - Felix
>
> On 06.04.20 22:26, Wes McKinney wrote:
> > For the sake of others reading, this discussion might be a bit
> > confusing to happen upon because the scope isn't clear. It seems that
> > we are discussing the C++ implementation and not the columnar format,
> > is that right?
> >
> > Adding any additional metadata about this to the columnar format /
> > Flatbuffers files / C interface is probably a non-starter. We've
> > discussed the contents of data "underneath" a null and consistently
> > the consensus is that is is unspecified.
> >
> > Applications (as well as internal details of some implementations and
> > their interactions with external libraries) are free to set
> > custom_metadata fields in schemas to indicate otherwise. However, one
> > must take care to not propagate this metadata inappropriately from one
> > realization of a schema (as an Array or RecordBatch) where it is true
> > to another where it is not true. Similarly, one should also be careful
> > not to use such metadata on data whose provenance is unknown.
> >
> > - Wes
> >
> > On Mon, Apr 6, 2020 at 11:37 AM Felix Benning <fe...@gmail.com>
> wrote:
> >> In that case it is probably necessary to have a "has_sentinel" flag and
> a
> >> "sentinel_value" variable. Since other algorithms might benefit from not
> >> having to set these values to zero. Which is probably the reason why the
> >> value "underneath" was set to unspecified in the first place.
> Alternatively
> >> a "sentinel_enum" could specify whether the sentinel is 0, or the R
> >> sentinel value is used. This would sacrifice flexibility for size.
> Although
> >> size probably does not matter, when meta data for entire columns are
> >> concerned. So the first approach is probably better.
> >>
> >> Felix
> >>
> >> On Mon, 6 Apr 2020 at 17:59, Francois Saint-Jacques <
> fsaintjacques@gmail.com>
> >> wrote:
> >>
> >>> It does make sense, I would go a little further and make this
> >>> field/property a single value of the same type than the array. This
> >>> would allow using any arbitrary sentinel value for unknown values (0
> >>> in your suggested case). The end result is zero-copy for R bindings
> >>> (if stars are aligned). I created ARROW-8348 [1] for this.
> >>>
> >>> François
> >>>
> >>> [1] https://jira.apache.org/jira/browse/ARROW-8348
> >>>
> >>> On Mon, Apr 6, 2020 at 11:02 AM Felix Benning <felix.benning@gmail.com
> >
> >>> wrote:
> >>>> Would it make sense to have an `na_are_zero` flag? Since null
> checking is
> >>>> not without cost, it might be helpful to some algorithms, if the
> content
> >>>> "underneath" the nulls is zero. For example in means, or scalar
> products
> >>>> and thus matrix multiplication, knowing that the array has zeros where
> >>> the
> >>>> na's are, would allow these algorithms to pretend that there are no
> na's.
> >>>> Since setting all nulls to zero in a matrix of n columns and n rows
> costs
> >>>> O(n^2), it would make sense to set them all to zero before matrix
> >>>> multiplication i.e. O(n^3) and similarly expensive algorithms. If
> there
> >>> was
> >>>> a `na_are_zero` flag, other algorithms could later utilize this work
> >>>> already being done. Algorithms which change the data and violate this
> >>>> contract, would only need to reset the flag. And in some use cases, it
> >>>> might be possible to use idle time of the computer to "clean up" the
> >>> na's,
> >>>> preparing for the next query.
> >>>>
> >>>> Felix
> >>>>
> >>>> ---------- Forwarded message ---------
> >>>> From: Wes McKinney <we...@gmail.com>
> >>>> Date: Sun, 5 Apr 2020 at 22:31
> >>>> Subject: Re: Attn: Wes, Re: Masked Arrays
> >>>> To: <us...@arrow.apache.org>
> >>>>
> >>>>
> >>>> As I recall the contents "underneath" have been discussed before and
> >>>> the consensus was that the contents are not specified. If you'e like
> >>>> to make a proposal to change something I would suggest raising it on
> >>>> dev@arrow.apache.org
> >>>>
> >>>> On Sun, Apr 5, 2020 at 1:56 PM Felix Benning <felix.benning@gmail.com
> >
> >>>> wrote:
> >>>>> Follow up: Do you think it would make sense to have an `na_are_zero`
> >>>> flag? Since it appears that the baseline (naively assuming there are
> no
> >>>> null values) is still a bit faster than equally optimized null value
> >>>> handling algorithms. So you might want to make the assumption, that
> all
> >>>> null values are set to zero in the array (instead of undefined). This
> >>> would
> >>>> allow for very fast means, scalar products and thus matrix
> multiplication
> >>>> which ignore nas. And in case of matrix multiplication, you might
> prefer
> >>>> sacrificing an O(n^2) effort to set all null entries to zero before
> >>>> multiplying. And assuming you do not overwrite this data, you would be
> >>> able
> >>>> to reuse that assumption in later computations with such a flag.
> >>>>> In some use cases, you might even be able to utilize unused computing
> >>>> resources for this task. I.e. clean up the nulls while the computer is
> >>> not
> >>>> used, preparing for the next query.
> >>>>>
> >>>>> On Sun, 5 Apr 2020 at 18:34, Felix Benning <fe...@gmail.com>
> >>>> wrote:
> >>>>>> Awesome, that was exactly what I was looking for, thank you!
> >>>>>>
> >>>>>> On Sun, 5 Apr 2020 at 00:40, Wes McKinney <we...@gmail.com>
> >>> wrote:
> >>>>>>> I wrote a blog post a couple of years about this
> >>>>>>>
> >>>>>>> https://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
> >>>>>>>
> >>>>>>> Pasha Stetsenko did a follow-up analysis that showed that my
> >>>>>>> "sentinel" code could be significantly improved, see:
> >>>>>>>
> >>>>>>> https://github.com/st-pasha/microbench-nas/blob/master/README.md
> >>>>>>>
> >>>>>>> Generally speaking in Apache Arrow we've been happy to have a
> uniform
> >>>>>>> representation of nullness across all types, both primitive
> >>> (booleans,
> >>>>>>> numbers, or strings) and nested (lists, structs, unions, etc.).
> Many
> >>>>>>> computational operations (like elementwise functions) need not
> >>> concern
> >>>>>>> themselves with the nulls at all, for example, since the bitmap
> from
> >>>>>>> the input array can be passed along (with zero copy even) to the
> >>>>>>> output array.
> >>>>>>>
> >>>>>>> On Sat, Apr 4, 2020 at 4:39 PM Felix Benning <
> >>> felix.benning@gmail.com>
> >>>> wrote:
> >>>>>>>> Does anyone have an opinion (or links) about Bitpattern vs Masked
> >>>> Arrays for NA implementations? There seems to have been a discussion
> >>> about
> >>>> that in the numpy community in 2012
> >>>> https://numpy.org/neps/nep-0026-missing-data-summary.html without an
> >>>> apparent result.
> >>>>>>>> Summary of the Summary:
> >>>>>>>> - The Bitpattern approach reserves one bitpattern of any type as
> >>> na,
> >>>> the only type not having spare bitpatterns are integers which means
> this
> >>>> decreases their range by one. This approach is taken by R and was
> >>> regarded
> >>>> as more performant in 2012.
> >>>>>>>> - The Mask approach was deemed more flexible, since it would allow
> >>>> "degrees of missingness", and also cleaner/easier implementation.
> >>>>>>>> Since bitpattern checks would probably disrupt SIMD, I feel like
> >>> some
> >>>> calculations (e.g. mean) would actually benefit more, from setting na
> >>>> values to zero, proceeding as if they were not there, and using the
> >>> number
> >>>> of nas in the metadata to adjust the result. This of course does not
> work
> >>>> if two columns are used (e.g. scalar product), which is probably more
> >>>> important.
> >>>>>>>> Was using Bitmasks in Arrow a conscious performance decision? Or
> >>> was
> >>>> the decision only based on the fact, that R and Bitpattern
> >>> implementations
> >>>> in general are a niche, which means that Bitmasks are more compatible
> >>> with
> >>>> other languages?
> >>>>>>>> I am curious about this topic, since the "lack of proper na
> >>> support"
> >>>> was cited as the reason, why Python would never replace R in
> statistics.
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> Felix
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 31.03.20 14:52, Joris Van den Bossche wrote:
> >>>>>>>>
> >>>>>>>> Note that pandas is starting to use a notion of "masked arrays" as
> >>>> well, for example for its nullable integer data type, but also not
> using
> >>>> the np.ma masked array, but a custom implementation (for technical
> >>> reasons
> >>>> in pandas this was easier).
> >>>>>>>> Also, there has been quite some discussion last year in numpy
> >>> about a
> >>>> possible re-implementation of a MaskedArray, but using numpy's
> protocols
> >>>> (`__array_ufunc__`, `__array_function__` etc), instead of being a
> >>> subclass
> >>>> like np.ma now is. See eg
> >>>>
> https://mail.python.org/pipermail/numpy-discussion/2019-June/079681.html
> >>> .
> >>>>>>>> Joris
> >>>>>>>>
> >>>>>>>> On Mon, 30 Mar 2020 at 18:57, Daniel Nugent <nu...@gmail.com>
> >>> wrote:
> >>>>>>>>> Ok. That actually aligns closely to what I'm familiar with. Good
> >>> to
> >>>> know.
> >>>>>>>>> Thanks again for taking the time to respond,
> >>>>>>>>>
> >>>>>>>>> -Dan Nugent
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Mon, Mar 30, 2020 at 12:38 PM Wes McKinney <
> >>> wesmckinn@gmail.com>
> >>>> wrote:
> >>>>>>>>>> Social and technical reasons I guess. Empirically it's just not
> >>>> used much.
> >>>>>>>>>> You can see my comments about numpy.ma in my 2010 paper about
> >>> pandas
> >>>>>>>>>>
> >>> https://conference.scipy.org/proceedings/scipy2010/pdfs/mckinney.pdf
> >>>>>>>>>> At least in 2010, there were notable performance problems when
> >>> using
> >>>>>>>>>> MaskedArray for computations
> >>>>>>>>>>
> >>>>>>>>>> "We chose to use NaN as opposed to using NumPy MaskedArrays for
> >>>>>>>>>> performance reasons (which are beyond the scope of this paper),
> >>> as
> >>>> NaN
> >>>>>>>>>> propagates in floating-point operations in a natural way and can
> >>> be
> >>>>>>>>>> easily detected in algorithms."
> >>>>>>>>>>
> >>>>>>>>>> On Mon, Mar 30, 2020 at 11:20 AM Daniel Nugent <
> nugend@gmail.com
> >>>> wrote:
> >>>>>>>>>>> Thanks! Since I'm just using it to jump to Arrow, I think I'll
> >>>> stick with it.
> >>>>>>>>>>> Do you have any feelings about why Numpy's masked arrays didn't
> >>>> gain favor when many data representation formats explicitly support
> >>> nullity
> >>>> (including Arrow)? Is it just that not carrying nulls in computations
> >>>> forward is preferable (that is, early filtering/value filling was
> >>> easier)?
> >>>>>>>>>>> -Dan Nugent
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Mon, Mar 30, 2020 at 11:40 AM Wes McKinney <
> >>> wesmckinn@gmail.com>
> >>>> wrote:
> >>>>>>>>>>>> On Mon, Mar 30, 2020 at 8:31 AM Daniel Nugent <
> >>> nugend@gmail.com>
> >>>> wrote:
> >>>>>>>>>>>>> Didn’t want to follow up on this on the Jira issue earlier
> >>>> since it's sort of tangential to that bug and more of a usage
> question.
> >>> You
> >>>> said:
> >>>>>>>>>>>>>> I wouldn't recommend building applications based on them
> >>>> nowadays since the level of support / compatibility in other projects
> is
> >>>> low.
> >>>>>>>>>>>>> In my case, I am using them since it seemed like a
> >>>> straightforward representation of my data that has nulls, the format
> I’m
> >>>> converting from has zero cost numpy representations, and converting
> from
> >>> an
> >>>> internal format into Arrow in memory structures appears zero cost (or
> >>> close
> >>>> to it) as well. I guess I can just provide the mask as an explicit
> >>>> argument, but my original desire to use it came from being able to
> >>> exploit
> >>>> numpy.ma.concatenate in a way that saved some complexity in
> >>> implementation.
> >>>>>>>>>>>>> Since Arrow itself supports masking values with a bitfield,
> >>> is
> >>>> there something intrinsic to the notion of array masks that is not
> well
> >>>> supported? Or do you just mean the specific numpy MaskedArray class?
> >>>>>>>>>>>> I mean just the numpy.ma module. Not many Python computing
> >>>> projects
> >>>>>>>>>>>> nowadays treat MaskedArray objects as first class citizens.
> >>>> Depending
> >>>>>>>>>>>> on what you need it may or may not be a problem. pyarrow
> >>> supports
> >>>>>>>>>>>> ingesting from MaskedArray as a convenience, but it would not
> >>> be
> >>>>>>>>>>>> common in my experience for a library's APIs to return
> >>>> MaskedArrays.
> >>>>>>>>>>>>> If this is too much of a numpy question rather than an arrow
> >>>> question, could you point me to where I can read up on masked array
> >>> support
> >>>> or maybe what the right place to ask the numpy community about whether
> >>> what
> >>>> I'm doing is appropriate or not.
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -Dan Nugent
>

Re: Attn: Wes, Re: Masked Arrays

Posted by Felix Benning <fe...@gmail.com>.
I guess it would be helpful, when trying to achieve zero-modification 
between R and another language, if the standard used for communication 
would allow for that. Or when setting all nulls to zero for an algorithm 
and then saving it to a database for later use. But at the same time, I 
only know about this project since a couple of days ago, so my opinion 
on this is likely uneducated at best. I am mostly curious about how this 
pans out and what the trade offs are, since I went down this rabbit hole 
of "how to handle nulls" this far, I might as well bottom out ;-).

- Felix

On 06.04.20 22:26, Wes McKinney wrote:
> For the sake of others reading, this discussion might be a bit
> confusing to happen upon because the scope isn't clear. It seems that
> we are discussing the C++ implementation and not the columnar format,
> is that right?
>
> Adding any additional metadata about this to the columnar format /
> Flatbuffers files / C interface is probably a non-starter. We've
> discussed the contents of data "underneath" a null and consistently
> the consensus is that is is unspecified.
>
> Applications (as well as internal details of some implementations and
> their interactions with external libraries) are free to set
> custom_metadata fields in schemas to indicate otherwise. However, one
> must take care to not propagate this metadata inappropriately from one
> realization of a schema (as an Array or RecordBatch) where it is true
> to another where it is not true. Similarly, one should also be careful
> not to use such metadata on data whose provenance is unknown.
>
> - Wes
>
> On Mon, Apr 6, 2020 at 11:37 AM Felix Benning <fe...@gmail.com> wrote:
>> In that case it is probably necessary to have a "has_sentinel" flag and a
>> "sentinel_value" variable. Since other algorithms might benefit from not
>> having to set these values to zero. Which is probably the reason why the
>> value "underneath" was set to unspecified in the first place. Alternatively
>> a "sentinel_enum" could specify whether the sentinel is 0, or the R
>> sentinel value is used. This would sacrifice flexibility for size. Although
>> size probably does not matter, when meta data for entire columns are
>> concerned. So the first approach is probably better.
>>
>> Felix
>>
>> On Mon, 6 Apr 2020 at 17:59, Francois Saint-Jacques <fs...@gmail.com>
>> wrote:
>>
>>> It does make sense, I would go a little further and make this
>>> field/property a single value of the same type than the array. This
>>> would allow using any arbitrary sentinel value for unknown values (0
>>> in your suggested case). The end result is zero-copy for R bindings
>>> (if stars are aligned). I created ARROW-8348 [1] for this.
>>>
>>> François
>>>
>>> [1] https://jira.apache.org/jira/browse/ARROW-8348
>>>
>>> On Mon, Apr 6, 2020 at 11:02 AM Felix Benning <fe...@gmail.com>
>>> wrote:
>>>> Would it make sense to have an `na_are_zero` flag? Since null checking is
>>>> not without cost, it might be helpful to some algorithms, if the content
>>>> "underneath" the nulls is zero. For example in means, or scalar products
>>>> and thus matrix multiplication, knowing that the array has zeros where
>>> the
>>>> na's are, would allow these algorithms to pretend that there are no na's.
>>>> Since setting all nulls to zero in a matrix of n columns and n rows costs
>>>> O(n^2), it would make sense to set them all to zero before matrix
>>>> multiplication i.e. O(n^3) and similarly expensive algorithms. If there
>>> was
>>>> a `na_are_zero` flag, other algorithms could later utilize this work
>>>> already being done. Algorithms which change the data and violate this
>>>> contract, would only need to reset the flag. And in some use cases, it
>>>> might be possible to use idle time of the computer to "clean up" the
>>> na's,
>>>> preparing for the next query.
>>>>
>>>> Felix
>>>>
>>>> ---------- Forwarded message ---------
>>>> From: Wes McKinney <we...@gmail.com>
>>>> Date: Sun, 5 Apr 2020 at 22:31
>>>> Subject: Re: Attn: Wes, Re: Masked Arrays
>>>> To: <us...@arrow.apache.org>
>>>>
>>>>
>>>> As I recall the contents "underneath" have been discussed before and
>>>> the consensus was that the contents are not specified. If you'e like
>>>> to make a proposal to change something I would suggest raising it on
>>>> dev@arrow.apache.org
>>>>
>>>> On Sun, Apr 5, 2020 at 1:56 PM Felix Benning <fe...@gmail.com>
>>>> wrote:
>>>>> Follow up: Do you think it would make sense to have an `na_are_zero`
>>>> flag? Since it appears that the baseline (naively assuming there are no
>>>> null values) is still a bit faster than equally optimized null value
>>>> handling algorithms. So you might want to make the assumption, that all
>>>> null values are set to zero in the array (instead of undefined). This
>>> would
>>>> allow for very fast means, scalar products and thus matrix multiplication
>>>> which ignore nas. And in case of matrix multiplication, you might prefer
>>>> sacrificing an O(n^2) effort to set all null entries to zero before
>>>> multiplying. And assuming you do not overwrite this data, you would be
>>> able
>>>> to reuse that assumption in later computations with such a flag.
>>>>> In some use cases, you might even be able to utilize unused computing
>>>> resources for this task. I.e. clean up the nulls while the computer is
>>> not
>>>> used, preparing for the next query.
>>>>>
>>>>> On Sun, 5 Apr 2020 at 18:34, Felix Benning <fe...@gmail.com>
>>>> wrote:
>>>>>> Awesome, that was exactly what I was looking for, thank you!
>>>>>>
>>>>>> On Sun, 5 Apr 2020 at 00:40, Wes McKinney <we...@gmail.com>
>>> wrote:
>>>>>>> I wrote a blog post a couple of years about this
>>>>>>>
>>>>>>> https://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
>>>>>>>
>>>>>>> Pasha Stetsenko did a follow-up analysis that showed that my
>>>>>>> "sentinel" code could be significantly improved, see:
>>>>>>>
>>>>>>> https://github.com/st-pasha/microbench-nas/blob/master/README.md
>>>>>>>
>>>>>>> Generally speaking in Apache Arrow we've been happy to have a uniform
>>>>>>> representation of nullness across all types, both primitive
>>> (booleans,
>>>>>>> numbers, or strings) and nested (lists, structs, unions, etc.). Many
>>>>>>> computational operations (like elementwise functions) need not
>>> concern
>>>>>>> themselves with the nulls at all, for example, since the bitmap from
>>>>>>> the input array can be passed along (with zero copy even) to the
>>>>>>> output array.
>>>>>>>
>>>>>>> On Sat, Apr 4, 2020 at 4:39 PM Felix Benning <
>>> felix.benning@gmail.com>
>>>> wrote:
>>>>>>>> Does anyone have an opinion (or links) about Bitpattern vs Masked
>>>> Arrays for NA implementations? There seems to have been a discussion
>>> about
>>>> that in the numpy community in 2012
>>>> https://numpy.org/neps/nep-0026-missing-data-summary.html without an
>>>> apparent result.
>>>>>>>> Summary of the Summary:
>>>>>>>> - The Bitpattern approach reserves one bitpattern of any type as
>>> na,
>>>> the only type not having spare bitpatterns are integers which means this
>>>> decreases their range by one. This approach is taken by R and was
>>> regarded
>>>> as more performant in 2012.
>>>>>>>> - The Mask approach was deemed more flexible, since it would allow
>>>> "degrees of missingness", and also cleaner/easier implementation.
>>>>>>>> Since bitpattern checks would probably disrupt SIMD, I feel like
>>> some
>>>> calculations (e.g. mean) would actually benefit more, from setting na
>>>> values to zero, proceeding as if they were not there, and using the
>>> number
>>>> of nas in the metadata to adjust the result. This of course does not work
>>>> if two columns are used (e.g. scalar product), which is probably more
>>>> important.
>>>>>>>> Was using Bitmasks in Arrow a conscious performance decision? Or
>>> was
>>>> the decision only based on the fact, that R and Bitpattern
>>> implementations
>>>> in general are a niche, which means that Bitmasks are more compatible
>>> with
>>>> other languages?
>>>>>>>> I am curious about this topic, since the "lack of proper na
>>> support"
>>>> was cited as the reason, why Python would never replace R in statistics.
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Felix
>>>>>>>>
>>>>>>>>
>>>>>>>> On 31.03.20 14:52, Joris Van den Bossche wrote:
>>>>>>>>
>>>>>>>> Note that pandas is starting to use a notion of "masked arrays" as
>>>> well, for example for its nullable integer data type, but also not using
>>>> the np.ma masked array, but a custom implementation (for technical
>>> reasons
>>>> in pandas this was easier).
>>>>>>>> Also, there has been quite some discussion last year in numpy
>>> about a
>>>> possible re-implementation of a MaskedArray, but using numpy's protocols
>>>> (`__array_ufunc__`, `__array_function__` etc), instead of being a
>>> subclass
>>>> like np.ma now is. See eg
>>>> https://mail.python.org/pipermail/numpy-discussion/2019-June/079681.html
>>> .
>>>>>>>> Joris
>>>>>>>>
>>>>>>>> On Mon, 30 Mar 2020 at 18:57, Daniel Nugent <nu...@gmail.com>
>>> wrote:
>>>>>>>>> Ok. That actually aligns closely to what I'm familiar with. Good
>>> to
>>>> know.
>>>>>>>>> Thanks again for taking the time to respond,
>>>>>>>>>
>>>>>>>>> -Dan Nugent
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Mar 30, 2020 at 12:38 PM Wes McKinney <
>>> wesmckinn@gmail.com>
>>>> wrote:
>>>>>>>>>> Social and technical reasons I guess. Empirically it's just not
>>>> used much.
>>>>>>>>>> You can see my comments about numpy.ma in my 2010 paper about
>>> pandas
>>>>>>>>>>
>>> https://conference.scipy.org/proceedings/scipy2010/pdfs/mckinney.pdf
>>>>>>>>>> At least in 2010, there were notable performance problems when
>>> using
>>>>>>>>>> MaskedArray for computations
>>>>>>>>>>
>>>>>>>>>> "We chose to use NaN as opposed to using NumPy MaskedArrays for
>>>>>>>>>> performance reasons (which are beyond the scope of this paper),
>>> as
>>>> NaN
>>>>>>>>>> propagates in floating-point operations in a natural way and can
>>> be
>>>>>>>>>> easily detected in algorithms."
>>>>>>>>>>
>>>>>>>>>> On Mon, Mar 30, 2020 at 11:20 AM Daniel Nugent <nugend@gmail.com
>>>> wrote:
>>>>>>>>>>> Thanks! Since I'm just using it to jump to Arrow, I think I'll
>>>> stick with it.
>>>>>>>>>>> Do you have any feelings about why Numpy's masked arrays didn't
>>>> gain favor when many data representation formats explicitly support
>>> nullity
>>>> (including Arrow)? Is it just that not carrying nulls in computations
>>>> forward is preferable (that is, early filtering/value filling was
>>> easier)?
>>>>>>>>>>> -Dan Nugent
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Mar 30, 2020 at 11:40 AM Wes McKinney <
>>> wesmckinn@gmail.com>
>>>> wrote:
>>>>>>>>>>>> On Mon, Mar 30, 2020 at 8:31 AM Daniel Nugent <
>>> nugend@gmail.com>
>>>> wrote:
>>>>>>>>>>>>> Didn’t want to follow up on this on the Jira issue earlier
>>>> since it's sort of tangential to that bug and more of a usage question.
>>> You
>>>> said:
>>>>>>>>>>>>>> I wouldn't recommend building applications based on them
>>>> nowadays since the level of support / compatibility in other projects is
>>>> low.
>>>>>>>>>>>>> In my case, I am using them since it seemed like a
>>>> straightforward representation of my data that has nulls, the format I’m
>>>> converting from has zero cost numpy representations, and converting from
>>> an
>>>> internal format into Arrow in memory structures appears zero cost (or
>>> close
>>>> to it) as well. I guess I can just provide the mask as an explicit
>>>> argument, but my original desire to use it came from being able to
>>> exploit
>>>> numpy.ma.concatenate in a way that saved some complexity in
>>> implementation.
>>>>>>>>>>>>> Since Arrow itself supports masking values with a bitfield,
>>> is
>>>> there something intrinsic to the notion of array masks that is not well
>>>> supported? Or do you just mean the specific numpy MaskedArray class?
>>>>>>>>>>>> I mean just the numpy.ma module. Not many Python computing
>>>> projects
>>>>>>>>>>>> nowadays treat MaskedArray objects as first class citizens.
>>>> Depending
>>>>>>>>>>>> on what you need it may or may not be a problem. pyarrow
>>> supports
>>>>>>>>>>>> ingesting from MaskedArray as a convenience, but it would not
>>> be
>>>>>>>>>>>> common in my experience for a library's APIs to return
>>>> MaskedArrays.
>>>>>>>>>>>>> If this is too much of a numpy question rather than an arrow
>>>> question, could you point me to where I can read up on masked array
>>> support
>>>> or maybe what the right place to ask the numpy community about whether
>>> what
>>>> I'm doing is appropriate or not.
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Dan Nugent

Re: Attn: Wes, Re: Masked Arrays

Posted by Wes McKinney <we...@gmail.com>.
For the sake of others reading, this discussion might be a bit
confusing to happen upon because the scope isn't clear. It seems that
we are discussing the C++ implementation and not the columnar format,
is that right?

Adding any additional metadata about this to the columnar format /
Flatbuffers files / C interface is probably a non-starter. We've
discussed the contents of data "underneath" a null and consistently
the consensus is that is is unspecified.

Applications (as well as internal details of some implementations and
their interactions with external libraries) are free to set
custom_metadata fields in schemas to indicate otherwise. However, one
must take care to not propagate this metadata inappropriately from one
realization of a schema (as an Array or RecordBatch) where it is true
to another where it is not true. Similarly, one should also be careful
not to use such metadata on data whose provenance is unknown.

- Wes

On Mon, Apr 6, 2020 at 11:37 AM Felix Benning <fe...@gmail.com> wrote:
>
> In that case it is probably necessary to have a "has_sentinel" flag and a
> "sentinel_value" variable. Since other algorithms might benefit from not
> having to set these values to zero. Which is probably the reason why the
> value "underneath" was set to unspecified in the first place. Alternatively
> a "sentinel_enum" could specify whether the sentinel is 0, or the R
> sentinel value is used. This would sacrifice flexibility for size. Although
> size probably does not matter, when meta data for entire columns are
> concerned. So the first approach is probably better.
>
> Felix
>
> On Mon, 6 Apr 2020 at 17:59, Francois Saint-Jacques <fs...@gmail.com>
> wrote:
>
> > It does make sense, I would go a little further and make this
> > field/property a single value of the same type than the array. This
> > would allow using any arbitrary sentinel value for unknown values (0
> > in your suggested case). The end result is zero-copy for R bindings
> > (if stars are aligned). I created ARROW-8348 [1] for this.
> >
> > François
> >
> > [1] https://jira.apache.org/jira/browse/ARROW-8348
> >
> > On Mon, Apr 6, 2020 at 11:02 AM Felix Benning <fe...@gmail.com>
> > wrote:
> > >
> > > Would it make sense to have an `na_are_zero` flag? Since null checking is
> > > not without cost, it might be helpful to some algorithms, if the content
> > > "underneath" the nulls is zero. For example in means, or scalar products
> > > and thus matrix multiplication, knowing that the array has zeros where
> > the
> > > na's are, would allow these algorithms to pretend that there are no na's.
> > > Since setting all nulls to zero in a matrix of n columns and n rows costs
> > > O(n^2), it would make sense to set them all to zero before matrix
> > > multiplication i.e. O(n^3) and similarly expensive algorithms. If there
> > was
> > > a `na_are_zero` flag, other algorithms could later utilize this work
> > > already being done. Algorithms which change the data and violate this
> > > contract, would only need to reset the flag. And in some use cases, it
> > > might be possible to use idle time of the computer to "clean up" the
> > na's,
> > > preparing for the next query.
> > >
> > > Felix
> > >
> > > ---------- Forwarded message ---------
> > > From: Wes McKinney <we...@gmail.com>
> > > Date: Sun, 5 Apr 2020 at 22:31
> > > Subject: Re: Attn: Wes, Re: Masked Arrays
> > > To: <us...@arrow.apache.org>
> > >
> > >
> > > As I recall the contents "underneath" have been discussed before and
> > > the consensus was that the contents are not specified. If you'e like
> > > to make a proposal to change something I would suggest raising it on
> > > dev@arrow.apache.org
> > >
> > > On Sun, Apr 5, 2020 at 1:56 PM Felix Benning <fe...@gmail.com>
> > > wrote:
> > > >
> > > > Follow up: Do you think it would make sense to have an `na_are_zero`
> > > flag? Since it appears that the baseline (naively assuming there are no
> > > null values) is still a bit faster than equally optimized null value
> > > handling algorithms. So you might want to make the assumption, that all
> > > null values are set to zero in the array (instead of undefined). This
> > would
> > > allow for very fast means, scalar products and thus matrix multiplication
> > > which ignore nas. And in case of matrix multiplication, you might prefer
> > > sacrificing an O(n^2) effort to set all null entries to zero before
> > > multiplying. And assuming you do not overwrite this data, you would be
> > able
> > > to reuse that assumption in later computations with such a flag.
> > > > In some use cases, you might even be able to utilize unused computing
> > > resources for this task. I.e. clean up the nulls while the computer is
> > not
> > > used, preparing for the next query.
> > > >
> > > >
> > > > On Sun, 5 Apr 2020 at 18:34, Felix Benning <fe...@gmail.com>
> > > wrote:
> > > >>
> > > >> Awesome, that was exactly what I was looking for, thank you!
> > > >>
> > > >> On Sun, 5 Apr 2020 at 00:40, Wes McKinney <we...@gmail.com>
> > wrote:
> > > >>>
> > > >>> I wrote a blog post a couple of years about this
> > > >>>
> > > >>> https://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
> > > >>>
> > > >>> Pasha Stetsenko did a follow-up analysis that showed that my
> > > >>> "sentinel" code could be significantly improved, see:
> > > >>>
> > > >>> https://github.com/st-pasha/microbench-nas/blob/master/README.md
> > > >>>
> > > >>> Generally speaking in Apache Arrow we've been happy to have a uniform
> > > >>> representation of nullness across all types, both primitive
> > (booleans,
> > > >>> numbers, or strings) and nested (lists, structs, unions, etc.). Many
> > > >>> computational operations (like elementwise functions) need not
> > concern
> > > >>> themselves with the nulls at all, for example, since the bitmap from
> > > >>> the input array can be passed along (with zero copy even) to the
> > > >>> output array.
> > > >>>
> > > >>> On Sat, Apr 4, 2020 at 4:39 PM Felix Benning <
> > felix.benning@gmail.com>
> > > wrote:
> > > >>> >
> > > >>> > Does anyone have an opinion (or links) about Bitpattern vs Masked
> > > Arrays for NA implementations? There seems to have been a discussion
> > about
> > > that in the numpy community in 2012
> > > https://numpy.org/neps/nep-0026-missing-data-summary.html without an
> > > apparent result.
> > > >>> >
> > > >>> > Summary of the Summary:
> > > >>> > - The Bitpattern approach reserves one bitpattern of any type as
> > na,
> > > the only type not having spare bitpatterns are integers which means this
> > > decreases their range by one. This approach is taken by R and was
> > regarded
> > > as more performant in 2012.
> > > >>> > - The Mask approach was deemed more flexible, since it would allow
> > > "degrees of missingness", and also cleaner/easier implementation.
> > > >>> >
> > > >>> > Since bitpattern checks would probably disrupt SIMD, I feel like
> > some
> > > calculations (e.g. mean) would actually benefit more, from setting na
> > > values to zero, proceeding as if they were not there, and using the
> > number
> > > of nas in the metadata to adjust the result. This of course does not work
> > > if two columns are used (e.g. scalar product), which is probably more
> > > important.
> > > >>> >
> > > >>> > Was using Bitmasks in Arrow a conscious performance decision? Or
> > was
> > > the decision only based on the fact, that R and Bitpattern
> > implementations
> > > in general are a niche, which means that Bitmasks are more compatible
> > with
> > > other languages?
> > > >>> >
> > > >>> > I am curious about this topic, since the "lack of proper na
> > support"
> > > was cited as the reason, why Python would never replace R in statistics.
> > > >>> >
> > > >>> > Thanks,
> > > >>> >
> > > >>> > Felix
> > > >>> >
> > > >>> >
> > > >>> > On 31.03.20 14:52, Joris Van den Bossche wrote:
> > > >>> >
> > > >>> > Note that pandas is starting to use a notion of "masked arrays" as
> > > well, for example for its nullable integer data type, but also not using
> > > the np.ma masked array, but a custom implementation (for technical
> > reasons
> > > in pandas this was easier).
> > > >>> >
> > > >>> > Also, there has been quite some discussion last year in numpy
> > about a
> > > possible re-implementation of a MaskedArray, but using numpy's protocols
> > > (`__array_ufunc__`, `__array_function__` etc), instead of being a
> > subclass
> > > like np.ma now is. See eg
> > > https://mail.python.org/pipermail/numpy-discussion/2019-June/079681.html
> > .
> > > >>> >
> > > >>> > Joris
> > > >>> >
> > > >>> > On Mon, 30 Mar 2020 at 18:57, Daniel Nugent <nu...@gmail.com>
> > wrote:
> > > >>> >>
> > > >>> >> Ok. That actually aligns closely to what I'm familiar with. Good
> > to
> > > know.
> > > >>> >>
> > > >>> >> Thanks again for taking the time to respond,
> > > >>> >>
> > > >>> >> -Dan Nugent
> > > >>> >>
> > > >>> >>
> > > >>> >> On Mon, Mar 30, 2020 at 12:38 PM Wes McKinney <
> > wesmckinn@gmail.com>
> > > wrote:
> > > >>> >>>
> > > >>> >>> Social and technical reasons I guess. Empirically it's just not
> > > used much.
> > > >>> >>>
> > > >>> >>> You can see my comments about numpy.ma in my 2010 paper about
> > pandas
> > > >>> >>>
> > > >>> >>>
> > https://conference.scipy.org/proceedings/scipy2010/pdfs/mckinney.pdf
> > > >>> >>>
> > > >>> >>> At least in 2010, there were notable performance problems when
> > using
> > > >>> >>> MaskedArray for computations
> > > >>> >>>
> > > >>> >>> "We chose to use NaN as opposed to using NumPy MaskedArrays for
> > > >>> >>> performance reasons (which are beyond the scope of this paper),
> > as
> > > NaN
> > > >>> >>> propagates in floating-point operations in a natural way and can
> > be
> > > >>> >>> easily detected in algorithms."
> > > >>> >>>
> > > >>> >>> On Mon, Mar 30, 2020 at 11:20 AM Daniel Nugent <nugend@gmail.com
> > >
> > > wrote:
> > > >>> >>> >
> > > >>> >>> > Thanks! Since I'm just using it to jump to Arrow, I think I'll
> > > stick with it.
> > > >>> >>> >
> > > >>> >>> > Do you have any feelings about why Numpy's masked arrays didn't
> > > gain favor when many data representation formats explicitly support
> > nullity
> > > (including Arrow)? Is it just that not carrying nulls in computations
> > > forward is preferable (that is, early filtering/value filling was
> > easier)?
> > > >>> >>> >
> > > >>> >>> > -Dan Nugent
> > > >>> >>> >
> > > >>> >>> >
> > > >>> >>> > On Mon, Mar 30, 2020 at 11:40 AM Wes McKinney <
> > wesmckinn@gmail.com>
> > > wrote:
> > > >>> >>> >>
> > > >>> >>> >> On Mon, Mar 30, 2020 at 8:31 AM Daniel Nugent <
> > nugend@gmail.com>
> > > wrote:
> > > >>> >>> >> >
> > > >>> >>> >> > Didn’t want to follow up on this on the Jira issue earlier
> > > since it's sort of tangential to that bug and more of a usage question.
> > You
> > > said:
> > > >>> >>> >> >
> > > >>> >>> >> > > I wouldn't recommend building applications based on them
> > > nowadays since the level of support / compatibility in other projects is
> > > low.
> > > >>> >>> >> >
> > > >>> >>> >> > In my case, I am using them since it seemed like a
> > > straightforward representation of my data that has nulls, the format I’m
> > > converting from has zero cost numpy representations, and converting from
> > an
> > > internal format into Arrow in memory structures appears zero cost (or
> > close
> > > to it) as well. I guess I can just provide the mask as an explicit
> > > argument, but my original desire to use it came from being able to
> > exploit
> > > numpy.ma.concatenate in a way that saved some complexity in
> > implementation.
> > > >>> >>> >> >
> > > >>> >>> >> > Since Arrow itself supports masking values with a bitfield,
> > is
> > > there something intrinsic to the notion of array masks that is not well
> > > supported? Or do you just mean the specific numpy MaskedArray class?
> > > >>> >>> >> >
> > > >>> >>> >>
> > > >>> >>> >> I mean just the numpy.ma module. Not many Python computing
> > > projects
> > > >>> >>> >> nowadays treat MaskedArray objects as first class citizens.
> > > Depending
> > > >>> >>> >> on what you need it may or may not be a problem. pyarrow
> > supports
> > > >>> >>> >> ingesting from MaskedArray as a convenience, but it would not
> > be
> > > >>> >>> >> common in my experience for a library's APIs to return
> > > MaskedArrays.
> > > >>> >>> >>
> > > >>> >>> >> > If this is too much of a numpy question rather than an arrow
> > > question, could you point me to where I can read up on masked array
> > support
> > > or maybe what the right place to ask the numpy community about whether
> > what
> > > I'm doing is appropriate or not.
> > > >>> >>> >> >
> > > >>> >>> >> > Thanks,
> > > >>> >>> >> >
> > > >>> >>> >> >
> > > >>> >>> >> > -Dan Nugent
> >

Re: Attn: Wes, Re: Masked Arrays

Posted by Felix Benning <fe...@gmail.com>.
In that case it is probably necessary to have a "has_sentinel" flag and a
"sentinel_value" variable. Since other algorithms might benefit from not
having to set these values to zero. Which is probably the reason why the
value "underneath" was set to unspecified in the first place. Alternatively
a "sentinel_enum" could specify whether the sentinel is 0, or the R
sentinel value is used. This would sacrifice flexibility for size. Although
size probably does not matter, when meta data for entire columns are
concerned. So the first approach is probably better.

Felix

On Mon, 6 Apr 2020 at 17:59, Francois Saint-Jacques <fs...@gmail.com>
wrote:

> It does make sense, I would go a little further and make this
> field/property a single value of the same type than the array. This
> would allow using any arbitrary sentinel value for unknown values (0
> in your suggested case). The end result is zero-copy for R bindings
> (if stars are aligned). I created ARROW-8348 [1] for this.
>
> François
>
> [1] https://jira.apache.org/jira/browse/ARROW-8348
>
> On Mon, Apr 6, 2020 at 11:02 AM Felix Benning <fe...@gmail.com>
> wrote:
> >
> > Would it make sense to have an `na_are_zero` flag? Since null checking is
> > not without cost, it might be helpful to some algorithms, if the content
> > "underneath" the nulls is zero. For example in means, or scalar products
> > and thus matrix multiplication, knowing that the array has zeros where
> the
> > na's are, would allow these algorithms to pretend that there are no na's.
> > Since setting all nulls to zero in a matrix of n columns and n rows costs
> > O(n^2), it would make sense to set them all to zero before matrix
> > multiplication i.e. O(n^3) and similarly expensive algorithms. If there
> was
> > a `na_are_zero` flag, other algorithms could later utilize this work
> > already being done. Algorithms which change the data and violate this
> > contract, would only need to reset the flag. And in some use cases, it
> > might be possible to use idle time of the computer to "clean up" the
> na's,
> > preparing for the next query.
> >
> > Felix
> >
> > ---------- Forwarded message ---------
> > From: Wes McKinney <we...@gmail.com>
> > Date: Sun, 5 Apr 2020 at 22:31
> > Subject: Re: Attn: Wes, Re: Masked Arrays
> > To: <us...@arrow.apache.org>
> >
> >
> > As I recall the contents "underneath" have been discussed before and
> > the consensus was that the contents are not specified. If you'e like
> > to make a proposal to change something I would suggest raising it on
> > dev@arrow.apache.org
> >
> > On Sun, Apr 5, 2020 at 1:56 PM Felix Benning <fe...@gmail.com>
> > wrote:
> > >
> > > Follow up: Do you think it would make sense to have an `na_are_zero`
> > flag? Since it appears that the baseline (naively assuming there are no
> > null values) is still a bit faster than equally optimized null value
> > handling algorithms. So you might want to make the assumption, that all
> > null values are set to zero in the array (instead of undefined). This
> would
> > allow for very fast means, scalar products and thus matrix multiplication
> > which ignore nas. And in case of matrix multiplication, you might prefer
> > sacrificing an O(n^2) effort to set all null entries to zero before
> > multiplying. And assuming you do not overwrite this data, you would be
> able
> > to reuse that assumption in later computations with such a flag.
> > > In some use cases, you might even be able to utilize unused computing
> > resources for this task. I.e. clean up the nulls while the computer is
> not
> > used, preparing for the next query.
> > >
> > >
> > > On Sun, 5 Apr 2020 at 18:34, Felix Benning <fe...@gmail.com>
> > wrote:
> > >>
> > >> Awesome, that was exactly what I was looking for, thank you!
> > >>
> > >> On Sun, 5 Apr 2020 at 00:40, Wes McKinney <we...@gmail.com>
> wrote:
> > >>>
> > >>> I wrote a blog post a couple of years about this
> > >>>
> > >>> https://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
> > >>>
> > >>> Pasha Stetsenko did a follow-up analysis that showed that my
> > >>> "sentinel" code could be significantly improved, see:
> > >>>
> > >>> https://github.com/st-pasha/microbench-nas/blob/master/README.md
> > >>>
> > >>> Generally speaking in Apache Arrow we've been happy to have a uniform
> > >>> representation of nullness across all types, both primitive
> (booleans,
> > >>> numbers, or strings) and nested (lists, structs, unions, etc.). Many
> > >>> computational operations (like elementwise functions) need not
> concern
> > >>> themselves with the nulls at all, for example, since the bitmap from
> > >>> the input array can be passed along (with zero copy even) to the
> > >>> output array.
> > >>>
> > >>> On Sat, Apr 4, 2020 at 4:39 PM Felix Benning <
> felix.benning@gmail.com>
> > wrote:
> > >>> >
> > >>> > Does anyone have an opinion (or links) about Bitpattern vs Masked
> > Arrays for NA implementations? There seems to have been a discussion
> about
> > that in the numpy community in 2012
> > https://numpy.org/neps/nep-0026-missing-data-summary.html without an
> > apparent result.
> > >>> >
> > >>> > Summary of the Summary:
> > >>> > - The Bitpattern approach reserves one bitpattern of any type as
> na,
> > the only type not having spare bitpatterns are integers which means this
> > decreases their range by one. This approach is taken by R and was
> regarded
> > as more performant in 2012.
> > >>> > - The Mask approach was deemed more flexible, since it would allow
> > "degrees of missingness", and also cleaner/easier implementation.
> > >>> >
> > >>> > Since bitpattern checks would probably disrupt SIMD, I feel like
> some
> > calculations (e.g. mean) would actually benefit more, from setting na
> > values to zero, proceeding as if they were not there, and using the
> number
> > of nas in the metadata to adjust the result. This of course does not work
> > if two columns are used (e.g. scalar product), which is probably more
> > important.
> > >>> >
> > >>> > Was using Bitmasks in Arrow a conscious performance decision? Or
> was
> > the decision only based on the fact, that R and Bitpattern
> implementations
> > in general are a niche, which means that Bitmasks are more compatible
> with
> > other languages?
> > >>> >
> > >>> > I am curious about this topic, since the "lack of proper na
> support"
> > was cited as the reason, why Python would never replace R in statistics.
> > >>> >
> > >>> > Thanks,
> > >>> >
> > >>> > Felix
> > >>> >
> > >>> >
> > >>> > On 31.03.20 14:52, Joris Van den Bossche wrote:
> > >>> >
> > >>> > Note that pandas is starting to use a notion of "masked arrays" as
> > well, for example for its nullable integer data type, but also not using
> > the np.ma masked array, but a custom implementation (for technical
> reasons
> > in pandas this was easier).
> > >>> >
> > >>> > Also, there has been quite some discussion last year in numpy
> about a
> > possible re-implementation of a MaskedArray, but using numpy's protocols
> > (`__array_ufunc__`, `__array_function__` etc), instead of being a
> subclass
> > like np.ma now is. See eg
> > https://mail.python.org/pipermail/numpy-discussion/2019-June/079681.html
> .
> > >>> >
> > >>> > Joris
> > >>> >
> > >>> > On Mon, 30 Mar 2020 at 18:57, Daniel Nugent <nu...@gmail.com>
> wrote:
> > >>> >>
> > >>> >> Ok. That actually aligns closely to what I'm familiar with. Good
> to
> > know.
> > >>> >>
> > >>> >> Thanks again for taking the time to respond,
> > >>> >>
> > >>> >> -Dan Nugent
> > >>> >>
> > >>> >>
> > >>> >> On Mon, Mar 30, 2020 at 12:38 PM Wes McKinney <
> wesmckinn@gmail.com>
> > wrote:
> > >>> >>>
> > >>> >>> Social and technical reasons I guess. Empirically it's just not
> > used much.
> > >>> >>>
> > >>> >>> You can see my comments about numpy.ma in my 2010 paper about
> pandas
> > >>> >>>
> > >>> >>>
> https://conference.scipy.org/proceedings/scipy2010/pdfs/mckinney.pdf
> > >>> >>>
> > >>> >>> At least in 2010, there were notable performance problems when
> using
> > >>> >>> MaskedArray for computations
> > >>> >>>
> > >>> >>> "We chose to use NaN as opposed to using NumPy MaskedArrays for
> > >>> >>> performance reasons (which are beyond the scope of this paper),
> as
> > NaN
> > >>> >>> propagates in floating-point operations in a natural way and can
> be
> > >>> >>> easily detected in algorithms."
> > >>> >>>
> > >>> >>> On Mon, Mar 30, 2020 at 11:20 AM Daniel Nugent <nugend@gmail.com
> >
> > wrote:
> > >>> >>> >
> > >>> >>> > Thanks! Since I'm just using it to jump to Arrow, I think I'll
> > stick with it.
> > >>> >>> >
> > >>> >>> > Do you have any feelings about why Numpy's masked arrays didn't
> > gain favor when many data representation formats explicitly support
> nullity
> > (including Arrow)? Is it just that not carrying nulls in computations
> > forward is preferable (that is, early filtering/value filling was
> easier)?
> > >>> >>> >
> > >>> >>> > -Dan Nugent
> > >>> >>> >
> > >>> >>> >
> > >>> >>> > On Mon, Mar 30, 2020 at 11:40 AM Wes McKinney <
> wesmckinn@gmail.com>
> > wrote:
> > >>> >>> >>
> > >>> >>> >> On Mon, Mar 30, 2020 at 8:31 AM Daniel Nugent <
> nugend@gmail.com>
> > wrote:
> > >>> >>> >> >
> > >>> >>> >> > Didn’t want to follow up on this on the Jira issue earlier
> > since it's sort of tangential to that bug and more of a usage question.
> You
> > said:
> > >>> >>> >> >
> > >>> >>> >> > > I wouldn't recommend building applications based on them
> > nowadays since the level of support / compatibility in other projects is
> > low.
> > >>> >>> >> >
> > >>> >>> >> > In my case, I am using them since it seemed like a
> > straightforward representation of my data that has nulls, the format I’m
> > converting from has zero cost numpy representations, and converting from
> an
> > internal format into Arrow in memory structures appears zero cost (or
> close
> > to it) as well. I guess I can just provide the mask as an explicit
> > argument, but my original desire to use it came from being able to
> exploit
> > numpy.ma.concatenate in a way that saved some complexity in
> implementation.
> > >>> >>> >> >
> > >>> >>> >> > Since Arrow itself supports masking values with a bitfield,
> is
> > there something intrinsic to the notion of array masks that is not well
> > supported? Or do you just mean the specific numpy MaskedArray class?
> > >>> >>> >> >
> > >>> >>> >>
> > >>> >>> >> I mean just the numpy.ma module. Not many Python computing
> > projects
> > >>> >>> >> nowadays treat MaskedArray objects as first class citizens.
> > Depending
> > >>> >>> >> on what you need it may or may not be a problem. pyarrow
> supports
> > >>> >>> >> ingesting from MaskedArray as a convenience, but it would not
> be
> > >>> >>> >> common in my experience for a library's APIs to return
> > MaskedArrays.
> > >>> >>> >>
> > >>> >>> >> > If this is too much of a numpy question rather than an arrow
> > question, could you point me to where I can read up on masked array
> support
> > or maybe what the right place to ask the numpy community about whether
> what
> > I'm doing is appropriate or not.
> > >>> >>> >> >
> > >>> >>> >> > Thanks,
> > >>> >>> >> >
> > >>> >>> >> >
> > >>> >>> >> > -Dan Nugent
>

Re: Attn: Wes, Re: Masked Arrays

Posted by Francois Saint-Jacques <fs...@gmail.com>.
It does make sense, I would go a little further and make this
field/property a single value of the same type than the array. This
would allow using any arbitrary sentinel value for unknown values (0
in your suggested case). The end result is zero-copy for R bindings
(if stars are aligned). I created ARROW-8348 [1] for this.

François

[1] https://jira.apache.org/jira/browse/ARROW-8348

On Mon, Apr 6, 2020 at 11:02 AM Felix Benning <fe...@gmail.com> wrote:
>
> Would it make sense to have an `na_are_zero` flag? Since null checking is
> not without cost, it might be helpful to some algorithms, if the content
> "underneath" the nulls is zero. For example in means, or scalar products
> and thus matrix multiplication, knowing that the array has zeros where the
> na's are, would allow these algorithms to pretend that there are no na's.
> Since setting all nulls to zero in a matrix of n columns and n rows costs
> O(n^2), it would make sense to set them all to zero before matrix
> multiplication i.e. O(n^3) and similarly expensive algorithms. If there was
> a `na_are_zero` flag, other algorithms could later utilize this work
> already being done. Algorithms which change the data and violate this
> contract, would only need to reset the flag. And in some use cases, it
> might be possible to use idle time of the computer to "clean up" the na's,
> preparing for the next query.
>
> Felix
>
> ---------- Forwarded message ---------
> From: Wes McKinney <we...@gmail.com>
> Date: Sun, 5 Apr 2020 at 22:31
> Subject: Re: Attn: Wes, Re: Masked Arrays
> To: <us...@arrow.apache.org>
>
>
> As I recall the contents "underneath" have been discussed before and
> the consensus was that the contents are not specified. If you'e like
> to make a proposal to change something I would suggest raising it on
> dev@arrow.apache.org
>
> On Sun, Apr 5, 2020 at 1:56 PM Felix Benning <fe...@gmail.com>
> wrote:
> >
> > Follow up: Do you think it would make sense to have an `na_are_zero`
> flag? Since it appears that the baseline (naively assuming there are no
> null values) is still a bit faster than equally optimized null value
> handling algorithms. So you might want to make the assumption, that all
> null values are set to zero in the array (instead of undefined). This would
> allow for very fast means, scalar products and thus matrix multiplication
> which ignore nas. And in case of matrix multiplication, you might prefer
> sacrificing an O(n^2) effort to set all null entries to zero before
> multiplying. And assuming you do not overwrite this data, you would be able
> to reuse that assumption in later computations with such a flag.
> > In some use cases, you might even be able to utilize unused computing
> resources for this task. I.e. clean up the nulls while the computer is not
> used, preparing for the next query.
> >
> >
> > On Sun, 5 Apr 2020 at 18:34, Felix Benning <fe...@gmail.com>
> wrote:
> >>
> >> Awesome, that was exactly what I was looking for, thank you!
> >>
> >> On Sun, 5 Apr 2020 at 00:40, Wes McKinney <we...@gmail.com> wrote:
> >>>
> >>> I wrote a blog post a couple of years about this
> >>>
> >>> https://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
> >>>
> >>> Pasha Stetsenko did a follow-up analysis that showed that my
> >>> "sentinel" code could be significantly improved, see:
> >>>
> >>> https://github.com/st-pasha/microbench-nas/blob/master/README.md
> >>>
> >>> Generally speaking in Apache Arrow we've been happy to have a uniform
> >>> representation of nullness across all types, both primitive (booleans,
> >>> numbers, or strings) and nested (lists, structs, unions, etc.). Many
> >>> computational operations (like elementwise functions) need not concern
> >>> themselves with the nulls at all, for example, since the bitmap from
> >>> the input array can be passed along (with zero copy even) to the
> >>> output array.
> >>>
> >>> On Sat, Apr 4, 2020 at 4:39 PM Felix Benning <fe...@gmail.com>
> wrote:
> >>> >
> >>> > Does anyone have an opinion (or links) about Bitpattern vs Masked
> Arrays for NA implementations? There seems to have been a discussion about
> that in the numpy community in 2012
> https://numpy.org/neps/nep-0026-missing-data-summary.html without an
> apparent result.
> >>> >
> >>> > Summary of the Summary:
> >>> > - The Bitpattern approach reserves one bitpattern of any type as na,
> the only type not having spare bitpatterns are integers which means this
> decreases their range by one. This approach is taken by R and was regarded
> as more performant in 2012.
> >>> > - The Mask approach was deemed more flexible, since it would allow
> "degrees of missingness", and also cleaner/easier implementation.
> >>> >
> >>> > Since bitpattern checks would probably disrupt SIMD, I feel like some
> calculations (e.g. mean) would actually benefit more, from setting na
> values to zero, proceeding as if they were not there, and using the number
> of nas in the metadata to adjust the result. This of course does not work
> if two columns are used (e.g. scalar product), which is probably more
> important.
> >>> >
> >>> > Was using Bitmasks in Arrow a conscious performance decision? Or was
> the decision only based on the fact, that R and Bitpattern implementations
> in general are a niche, which means that Bitmasks are more compatible with
> other languages?
> >>> >
> >>> > I am curious about this topic, since the "lack of proper na support"
> was cited as the reason, why Python would never replace R in statistics.
> >>> >
> >>> > Thanks,
> >>> >
> >>> > Felix
> >>> >
> >>> >
> >>> > On 31.03.20 14:52, Joris Van den Bossche wrote:
> >>> >
> >>> > Note that pandas is starting to use a notion of "masked arrays" as
> well, for example for its nullable integer data type, but also not using
> the np.ma masked array, but a custom implementation (for technical reasons
> in pandas this was easier).
> >>> >
> >>> > Also, there has been quite some discussion last year in numpy about a
> possible re-implementation of a MaskedArray, but using numpy's protocols
> (`__array_ufunc__`, `__array_function__` etc), instead of being a subclass
> like np.ma now is. See eg
> https://mail.python.org/pipermail/numpy-discussion/2019-June/079681.html.
> >>> >
> >>> > Joris
> >>> >
> >>> > On Mon, 30 Mar 2020 at 18:57, Daniel Nugent <nu...@gmail.com> wrote:
> >>> >>
> >>> >> Ok. That actually aligns closely to what I'm familiar with. Good to
> know.
> >>> >>
> >>> >> Thanks again for taking the time to respond,
> >>> >>
> >>> >> -Dan Nugent
> >>> >>
> >>> >>
> >>> >> On Mon, Mar 30, 2020 at 12:38 PM Wes McKinney <we...@gmail.com>
> wrote:
> >>> >>>
> >>> >>> Social and technical reasons I guess. Empirically it's just not
> used much.
> >>> >>>
> >>> >>> You can see my comments about numpy.ma in my 2010 paper about pandas
> >>> >>>
> >>> >>> https://conference.scipy.org/proceedings/scipy2010/pdfs/mckinney.pdf
> >>> >>>
> >>> >>> At least in 2010, there were notable performance problems when using
> >>> >>> MaskedArray for computations
> >>> >>>
> >>> >>> "We chose to use NaN as opposed to using NumPy MaskedArrays for
> >>> >>> performance reasons (which are beyond the scope of this paper), as
> NaN
> >>> >>> propagates in floating-point operations in a natural way and can be
> >>> >>> easily detected in algorithms."
> >>> >>>
> >>> >>> On Mon, Mar 30, 2020 at 11:20 AM Daniel Nugent <nu...@gmail.com>
> wrote:
> >>> >>> >
> >>> >>> > Thanks! Since I'm just using it to jump to Arrow, I think I'll
> stick with it.
> >>> >>> >
> >>> >>> > Do you have any feelings about why Numpy's masked arrays didn't
> gain favor when many data representation formats explicitly support nullity
> (including Arrow)? Is it just that not carrying nulls in computations
> forward is preferable (that is, early filtering/value filling was easier)?
> >>> >>> >
> >>> >>> > -Dan Nugent
> >>> >>> >
> >>> >>> >
> >>> >>> > On Mon, Mar 30, 2020 at 11:40 AM Wes McKinney <we...@gmail.com>
> wrote:
> >>> >>> >>
> >>> >>> >> On Mon, Mar 30, 2020 at 8:31 AM Daniel Nugent <nu...@gmail.com>
> wrote:
> >>> >>> >> >
> >>> >>> >> > Didn’t want to follow up on this on the Jira issue earlier
> since it's sort of tangential to that bug and more of a usage question. You
> said:
> >>> >>> >> >
> >>> >>> >> > > I wouldn't recommend building applications based on them
> nowadays since the level of support / compatibility in other projects is
> low.
> >>> >>> >> >
> >>> >>> >> > In my case, I am using them since it seemed like a
> straightforward representation of my data that has nulls, the format I’m
> converting from has zero cost numpy representations, and converting from an
> internal format into Arrow in memory structures appears zero cost (or close
> to it) as well. I guess I can just provide the mask as an explicit
> argument, but my original desire to use it came from being able to exploit
> numpy.ma.concatenate in a way that saved some complexity in implementation.
> >>> >>> >> >
> >>> >>> >> > Since Arrow itself supports masking values with a bitfield, is
> there something intrinsic to the notion of array masks that is not well
> supported? Or do you just mean the specific numpy MaskedArray class?
> >>> >>> >> >
> >>> >>> >>
> >>> >>> >> I mean just the numpy.ma module. Not many Python computing
> projects
> >>> >>> >> nowadays treat MaskedArray objects as first class citizens.
> Depending
> >>> >>> >> on what you need it may or may not be a problem. pyarrow supports
> >>> >>> >> ingesting from MaskedArray as a convenience, but it would not be
> >>> >>> >> common in my experience for a library's APIs to return
> MaskedArrays.
> >>> >>> >>
> >>> >>> >> > If this is too much of a numpy question rather than an arrow
> question, could you point me to where I can read up on masked array support
> or maybe what the right place to ask the numpy community about whether what
> I'm doing is appropriate or not.
> >>> >>> >> >
> >>> >>> >> > Thanks,
> >>> >>> >> >
> >>> >>> >> >
> >>> >>> >> > -Dan Nugent

Fwd: Attn: Wes, Re: Masked Arrays

Posted by Felix Benning <fe...@gmail.com>.
Would it make sense to have an `na_are_zero` flag? Since null checking is
not without cost, it might be helpful to some algorithms, if the content
"underneath" the nulls is zero. For example in means, or scalar products
and thus matrix multiplication, knowing that the array has zeros where the
na's are, would allow these algorithms to pretend that there are no na's.
Since setting all nulls to zero in a matrix of n columns and n rows costs
O(n^2), it would make sense to set them all to zero before matrix
multiplication i.e. O(n^3) and similarly expensive algorithms. If there was
a `na_are_zero` flag, other algorithms could later utilize this work
already being done. Algorithms which change the data and violate this
contract, would only need to reset the flag. And in some use cases, it
might be possible to use idle time of the computer to "clean up" the na's,
preparing for the next query.

Felix

---------- Forwarded message ---------
From: Wes McKinney <we...@gmail.com>
Date: Sun, 5 Apr 2020 at 22:31
Subject: Re: Attn: Wes, Re: Masked Arrays
To: <us...@arrow.apache.org>


As I recall the contents "underneath" have been discussed before and
the consensus was that the contents are not specified. If you'e like
to make a proposal to change something I would suggest raising it on
dev@arrow.apache.org

On Sun, Apr 5, 2020 at 1:56 PM Felix Benning <fe...@gmail.com>
wrote:
>
> Follow up: Do you think it would make sense to have an `na_are_zero`
flag? Since it appears that the baseline (naively assuming there are no
null values) is still a bit faster than equally optimized null value
handling algorithms. So you might want to make the assumption, that all
null values are set to zero in the array (instead of undefined). This would
allow for very fast means, scalar products and thus matrix multiplication
which ignore nas. And in case of matrix multiplication, you might prefer
sacrificing an O(n^2) effort to set all null entries to zero before
multiplying. And assuming you do not overwrite this data, you would be able
to reuse that assumption in later computations with such a flag.
> In some use cases, you might even be able to utilize unused computing
resources for this task. I.e. clean up the nulls while the computer is not
used, preparing for the next query.
>
>
> On Sun, 5 Apr 2020 at 18:34, Felix Benning <fe...@gmail.com>
wrote:
>>
>> Awesome, that was exactly what I was looking for, thank you!
>>
>> On Sun, 5 Apr 2020 at 00:40, Wes McKinney <we...@gmail.com> wrote:
>>>
>>> I wrote a blog post a couple of years about this
>>>
>>> https://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
>>>
>>> Pasha Stetsenko did a follow-up analysis that showed that my
>>> "sentinel" code could be significantly improved, see:
>>>
>>> https://github.com/st-pasha/microbench-nas/blob/master/README.md
>>>
>>> Generally speaking in Apache Arrow we've been happy to have a uniform
>>> representation of nullness across all types, both primitive (booleans,
>>> numbers, or strings) and nested (lists, structs, unions, etc.). Many
>>> computational operations (like elementwise functions) need not concern
>>> themselves with the nulls at all, for example, since the bitmap from
>>> the input array can be passed along (with zero copy even) to the
>>> output array.
>>>
>>> On Sat, Apr 4, 2020 at 4:39 PM Felix Benning <fe...@gmail.com>
wrote:
>>> >
>>> > Does anyone have an opinion (or links) about Bitpattern vs Masked
Arrays for NA implementations? There seems to have been a discussion about
that in the numpy community in 2012
https://numpy.org/neps/nep-0026-missing-data-summary.html without an
apparent result.
>>> >
>>> > Summary of the Summary:
>>> > - The Bitpattern approach reserves one bitpattern of any type as na,
the only type not having spare bitpatterns are integers which means this
decreases their range by one. This approach is taken by R and was regarded
as more performant in 2012.
>>> > - The Mask approach was deemed more flexible, since it would allow
"degrees of missingness", and also cleaner/easier implementation.
>>> >
>>> > Since bitpattern checks would probably disrupt SIMD, I feel like some
calculations (e.g. mean) would actually benefit more, from setting na
values to zero, proceeding as if they were not there, and using the number
of nas in the metadata to adjust the result. This of course does not work
if two columns are used (e.g. scalar product), which is probably more
important.
>>> >
>>> > Was using Bitmasks in Arrow a conscious performance decision? Or was
the decision only based on the fact, that R and Bitpattern implementations
in general are a niche, which means that Bitmasks are more compatible with
other languages?
>>> >
>>> > I am curious about this topic, since the "lack of proper na support"
was cited as the reason, why Python would never replace R in statistics.
>>> >
>>> > Thanks,
>>> >
>>> > Felix
>>> >
>>> >
>>> > On 31.03.20 14:52, Joris Van den Bossche wrote:
>>> >
>>> > Note that pandas is starting to use a notion of "masked arrays" as
well, for example for its nullable integer data type, but also not using
the np.ma masked array, but a custom implementation (for technical reasons
in pandas this was easier).
>>> >
>>> > Also, there has been quite some discussion last year in numpy about a
possible re-implementation of a MaskedArray, but using numpy's protocols
(`__array_ufunc__`, `__array_function__` etc), instead of being a subclass
like np.ma now is. See eg
https://mail.python.org/pipermail/numpy-discussion/2019-June/079681.html.
>>> >
>>> > Joris
>>> >
>>> > On Mon, 30 Mar 2020 at 18:57, Daniel Nugent <nu...@gmail.com> wrote:
>>> >>
>>> >> Ok. That actually aligns closely to what I'm familiar with. Good to
know.
>>> >>
>>> >> Thanks again for taking the time to respond,
>>> >>
>>> >> -Dan Nugent
>>> >>
>>> >>
>>> >> On Mon, Mar 30, 2020 at 12:38 PM Wes McKinney <we...@gmail.com>
wrote:
>>> >>>
>>> >>> Social and technical reasons I guess. Empirically it's just not
used much.
>>> >>>
>>> >>> You can see my comments about numpy.ma in my 2010 paper about pandas
>>> >>>
>>> >>> https://conference.scipy.org/proceedings/scipy2010/pdfs/mckinney.pdf
>>> >>>
>>> >>> At least in 2010, there were notable performance problems when using
>>> >>> MaskedArray for computations
>>> >>>
>>> >>> "We chose to use NaN as opposed to using NumPy MaskedArrays for
>>> >>> performance reasons (which are beyond the scope of this paper), as
NaN
>>> >>> propagates in floating-point operations in a natural way and can be
>>> >>> easily detected in algorithms."
>>> >>>
>>> >>> On Mon, Mar 30, 2020 at 11:20 AM Daniel Nugent <nu...@gmail.com>
wrote:
>>> >>> >
>>> >>> > Thanks! Since I'm just using it to jump to Arrow, I think I'll
stick with it.
>>> >>> >
>>> >>> > Do you have any feelings about why Numpy's masked arrays didn't
gain favor when many data representation formats explicitly support nullity
(including Arrow)? Is it just that not carrying nulls in computations
forward is preferable (that is, early filtering/value filling was easier)?
>>> >>> >
>>> >>> > -Dan Nugent
>>> >>> >
>>> >>> >
>>> >>> > On Mon, Mar 30, 2020 at 11:40 AM Wes McKinney <we...@gmail.com>
wrote:
>>> >>> >>
>>> >>> >> On Mon, Mar 30, 2020 at 8:31 AM Daniel Nugent <nu...@gmail.com>
wrote:
>>> >>> >> >
>>> >>> >> > Didn’t want to follow up on this on the Jira issue earlier
since it's sort of tangential to that bug and more of a usage question. You
said:
>>> >>> >> >
>>> >>> >> > > I wouldn't recommend building applications based on them
nowadays since the level of support / compatibility in other projects is
low.
>>> >>> >> >
>>> >>> >> > In my case, I am using them since it seemed like a
straightforward representation of my data that has nulls, the format I’m
converting from has zero cost numpy representations, and converting from an
internal format into Arrow in memory structures appears zero cost (or close
to it) as well. I guess I can just provide the mask as an explicit
argument, but my original desire to use it came from being able to exploit
numpy.ma.concatenate in a way that saved some complexity in implementation.
>>> >>> >> >
>>> >>> >> > Since Arrow itself supports masking values with a bitfield, is
there something intrinsic to the notion of array masks that is not well
supported? Or do you just mean the specific numpy MaskedArray class?
>>> >>> >> >
>>> >>> >>
>>> >>> >> I mean just the numpy.ma module. Not many Python computing
projects
>>> >>> >> nowadays treat MaskedArray objects as first class citizens.
Depending
>>> >>> >> on what you need it may or may not be a problem. pyarrow supports
>>> >>> >> ingesting from MaskedArray as a convenience, but it would not be
>>> >>> >> common in my experience for a library's APIs to return
MaskedArrays.
>>> >>> >>
>>> >>> >> > If this is too much of a numpy question rather than an arrow
question, could you point me to where I can read up on masked array support
or maybe what the right place to ask the numpy community about whether what
I'm doing is appropriate or not.
>>> >>> >> >
>>> >>> >> > Thanks,
>>> >>> >> >
>>> >>> >> >
>>> >>> >> > -Dan Nugent

Re: Attn: Wes, Re: Masked Arrays

Posted by Wes McKinney <we...@gmail.com>.
As I recall the contents "underneath" have been discussed before and
the consensus was that the contents are not specified. If you'e like
to make a proposal to change something I would suggest raising it on
dev@arrow.apache.org

On Sun, Apr 5, 2020 at 1:56 PM Felix Benning <fe...@gmail.com> wrote:
>
> Follow up: Do you think it would make sense to have an `na_are_zero` flag? Since it appears that the baseline (naively assuming there are no null values) is still a bit faster than equally optimized null value handling algorithms. So you might want to make the assumption, that all null values are set to zero in the array (instead of undefined). This would allow for very fast means, scalar products and thus matrix multiplication which ignore nas. And in case of matrix multiplication, you might prefer sacrificing an O(n^2) effort to set all null entries to zero before multiplying. And assuming you do not overwrite this data, you would be able to reuse that assumption in later computations with such a flag.
> In some use cases, you might even be able to utilize unused computing resources for this task. I.e. clean up the nulls while the computer is not used, preparing for the next query.
>
>
> On Sun, 5 Apr 2020 at 18:34, Felix Benning <fe...@gmail.com> wrote:
>>
>> Awesome, that was exactly what I was looking for, thank you!
>>
>> On Sun, 5 Apr 2020 at 00:40, Wes McKinney <we...@gmail.com> wrote:
>>>
>>> I wrote a blog post a couple of years about this
>>>
>>> https://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
>>>
>>> Pasha Stetsenko did a follow-up analysis that showed that my
>>> "sentinel" code could be significantly improved, see:
>>>
>>> https://github.com/st-pasha/microbench-nas/blob/master/README.md
>>>
>>> Generally speaking in Apache Arrow we've been happy to have a uniform
>>> representation of nullness across all types, both primitive (booleans,
>>> numbers, or strings) and nested (lists, structs, unions, etc.). Many
>>> computational operations (like elementwise functions) need not concern
>>> themselves with the nulls at all, for example, since the bitmap from
>>> the input array can be passed along (with zero copy even) to the
>>> output array.
>>>
>>> On Sat, Apr 4, 2020 at 4:39 PM Felix Benning <fe...@gmail.com> wrote:
>>> >
>>> > Does anyone have an opinion (or links) about Bitpattern vs Masked Arrays for NA implementations? There seems to have been a discussion about that in the numpy community in 2012 https://numpy.org/neps/nep-0026-missing-data-summary.html without an apparent result.
>>> >
>>> > Summary of the Summary:
>>> > - The Bitpattern approach reserves one bitpattern of any type as na, the only type not having spare bitpatterns are integers which means this decreases their range by one. This approach is taken by R and was regarded as more performant in 2012.
>>> > - The Mask approach was deemed more flexible, since it would allow "degrees of missingness", and also cleaner/easier implementation.
>>> >
>>> > Since bitpattern checks would probably disrupt SIMD, I feel like some calculations (e.g. mean) would actually benefit more, from setting na values to zero, proceeding as if they were not there, and using the number of nas in the metadata to adjust the result. This of course does not work if two columns are used (e.g. scalar product), which is probably more important.
>>> >
>>> > Was using Bitmasks in Arrow a conscious performance decision? Or was the decision only based on the fact, that R and Bitpattern implementations in general are a niche, which means that Bitmasks are more compatible with other languages?
>>> >
>>> > I am curious about this topic, since the "lack of proper na support" was cited as the reason, why Python would never replace R in statistics.
>>> >
>>> > Thanks,
>>> >
>>> > Felix
>>> >
>>> >
>>> > On 31.03.20 14:52, Joris Van den Bossche wrote:
>>> >
>>> > Note that pandas is starting to use a notion of "masked arrays" as well, for example for its nullable integer data type, but also not using the np.ma masked array, but a custom implementation (for technical reasons in pandas this was easier).
>>> >
>>> > Also, there has been quite some discussion last year in numpy about a possible re-implementation of a MaskedArray, but using numpy's protocols (`__array_ufunc__`, `__array_function__` etc), instead of being a subclass like np.ma now is. See eg https://mail.python.org/pipermail/numpy-discussion/2019-June/079681.html.
>>> >
>>> > Joris
>>> >
>>> > On Mon, 30 Mar 2020 at 18:57, Daniel Nugent <nu...@gmail.com> wrote:
>>> >>
>>> >> Ok. That actually aligns closely to what I'm familiar with. Good to know.
>>> >>
>>> >> Thanks again for taking the time to respond,
>>> >>
>>> >> -Dan Nugent
>>> >>
>>> >>
>>> >> On Mon, Mar 30, 2020 at 12:38 PM Wes McKinney <we...@gmail.com> wrote:
>>> >>>
>>> >>> Social and technical reasons I guess. Empirically it's just not used much.
>>> >>>
>>> >>> You can see my comments about numpy.ma in my 2010 paper about pandas
>>> >>>
>>> >>> https://conference.scipy.org/proceedings/scipy2010/pdfs/mckinney.pdf
>>> >>>
>>> >>> At least in 2010, there were notable performance problems when using
>>> >>> MaskedArray for computations
>>> >>>
>>> >>> "We chose to use NaN as opposed to using NumPy MaskedArrays for
>>> >>> performance reasons (which are beyond the scope of this paper), as NaN
>>> >>> propagates in floating-point operations in a natural way and can be
>>> >>> easily detected in algorithms."
>>> >>>
>>> >>> On Mon, Mar 30, 2020 at 11:20 AM Daniel Nugent <nu...@gmail.com> wrote:
>>> >>> >
>>> >>> > Thanks! Since I'm just using it to jump to Arrow, I think I'll stick with it.
>>> >>> >
>>> >>> > Do you have any feelings about why Numpy's masked arrays didn't gain favor when many data representation formats explicitly support nullity (including Arrow)? Is it just that not carrying nulls in computations forward is preferable (that is, early filtering/value filling was easier)?
>>> >>> >
>>> >>> > -Dan Nugent
>>> >>> >
>>> >>> >
>>> >>> > On Mon, Mar 30, 2020 at 11:40 AM Wes McKinney <we...@gmail.com> wrote:
>>> >>> >>
>>> >>> >> On Mon, Mar 30, 2020 at 8:31 AM Daniel Nugent <nu...@gmail.com> wrote:
>>> >>> >> >
>>> >>> >> > Didn’t want to follow up on this on the Jira issue earlier since it's sort of tangential to that bug and more of a usage question. You said:
>>> >>> >> >
>>> >>> >> > > I wouldn't recommend building applications based on them nowadays since the level of support / compatibility in other projects is low.
>>> >>> >> >
>>> >>> >> > In my case, I am using them since it seemed like a straightforward representation of my data that has nulls, the format I’m converting from has zero cost numpy representations, and converting from an internal format into Arrow in memory structures appears zero cost (or close to it) as well. I guess I can just provide the mask as an explicit argument, but my original desire to use it came from being able to exploit numpy.ma.concatenate in a way that saved some complexity in implementation.
>>> >>> >> >
>>> >>> >> > Since Arrow itself supports masking values with a bitfield, is there something intrinsic to the notion of array masks that is not well supported? Or do you just mean the specific numpy MaskedArray class?
>>> >>> >> >
>>> >>> >>
>>> >>> >> I mean just the numpy.ma module. Not many Python computing projects
>>> >>> >> nowadays treat MaskedArray objects as first class citizens. Depending
>>> >>> >> on what you need it may or may not be a problem. pyarrow supports
>>> >>> >> ingesting from MaskedArray as a convenience, but it would not be
>>> >>> >> common in my experience for a library's APIs to return MaskedArrays.
>>> >>> >>
>>> >>> >> > If this is too much of a numpy question rather than an arrow question, could you point me to where I can read up on masked array support or maybe what the right place to ask the numpy community about whether what I'm doing is appropriate or not.
>>> >>> >> >
>>> >>> >> > Thanks,
>>> >>> >> >
>>> >>> >> >
>>> >>> >> > -Dan Nugent

Re: Attn: Wes, Re: Masked Arrays

Posted by Felix Benning <fe...@gmail.com>.
Follow up: Do you think it would make sense to have an `na_are_zero` flag?
Since it appears that the baseline (naively assuming there are no null
values) is still a bit faster than equally optimized null value handling
algorithms. So you might want to make the assumption, that all null values
are set to zero in the array (instead of undefined). This would allow for
very fast means, scalar products and thus matrix multiplication which
ignore nas. And in case of matrix multiplication, you might prefer
sacrificing an O(n^2) effort to set all null entries to zero before
multiplying. And assuming you do not overwrite this data, you would be able
to reuse that assumption in later computations with such a flag.
In some use cases, you might even be able to utilize unused computing
resources for this task. I.e. clean up the nulls while the computer is not
used, preparing for the next query.


On Sun, 5 Apr 2020 at 18:34, Felix Benning <fe...@gmail.com> wrote:

> Awesome, that was exactly what I was looking for, thank you!
>
> On Sun, 5 Apr 2020 at 00:40, Wes McKinney <we...@gmail.com> wrote:
>
>> I wrote a blog post a couple of years about this
>>
>> https://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
>>
>> Pasha Stetsenko did a follow-up analysis that showed that my
>> "sentinel" code could be significantly improved, see:
>>
>> https://github.com/st-pasha/microbench-nas/blob/master/README.md
>>
>> Generally speaking in Apache Arrow we've been happy to have a uniform
>> representation of nullness across all types, both primitive (booleans,
>> numbers, or strings) and nested (lists, structs, unions, etc.). Many
>> computational operations (like elementwise functions) need not concern
>> themselves with the nulls at all, for example, since the bitmap from
>> the input array can be passed along (with zero copy even) to the
>> output array.
>>
>> On Sat, Apr 4, 2020 at 4:39 PM Felix Benning <fe...@gmail.com>
>> wrote:
>> >
>> > Does anyone have an opinion (or links) about Bitpattern vs Masked
>> Arrays for NA implementations? There seems to have been a discussion about
>> that in the numpy community in 2012
>> https://numpy.org/neps/nep-0026-missing-data-summary.html without an
>> apparent result.
>> >
>> > Summary of the Summary:
>> > - The Bitpattern approach reserves one bitpattern of any type as na,
>> the only type not having spare bitpatterns are integers which means this
>> decreases their range by one. This approach is taken by R and was regarded
>> as more performant in 2012.
>> > - The Mask approach was deemed more flexible, since it would allow
>> "degrees of missingness", and also cleaner/easier implementation.
>> >
>> > Since bitpattern checks would probably disrupt SIMD, I feel like some
>> calculations (e.g. mean) would actually benefit more, from setting na
>> values to zero, proceeding as if they were not there, and using the number
>> of nas in the metadata to adjust the result. This of course does not work
>> if two columns are used (e.g. scalar product), which is probably more
>> important.
>> >
>> > Was using Bitmasks in Arrow a conscious performance decision? Or was
>> the decision only based on the fact, that R and Bitpattern implementations
>> in general are a niche, which means that Bitmasks are more compatible with
>> other languages?
>> >
>> > I am curious about this topic, since the "lack of proper na support"
>> was cited as the reason, why Python would never replace R in statistics.
>> >
>> > Thanks,
>> >
>> > Felix
>> >
>> >
>> > On 31.03.20 14:52, Joris Van den Bossche wrote:
>> >
>> > Note that pandas is starting to use a notion of "masked arrays" as
>> well, for example for its nullable integer data type, but also not using
>> the np.ma masked array, but a custom implementation (for technical
>> reasons in pandas this was easier).
>> >
>> > Also, there has been quite some discussion last year in numpy about a
>> possible re-implementation of a MaskedArray, but using numpy's protocols
>> (`__array_ufunc__`, `__array_function__` etc), instead of being a subclass
>> like np.ma now is. See eg
>> https://mail.python.org/pipermail/numpy-discussion/2019-June/079681.html.
>> >
>> > Joris
>> >
>> > On Mon, 30 Mar 2020 at 18:57, Daniel Nugent <nu...@gmail.com> wrote:
>> >>
>> >> Ok. That actually aligns closely to what I'm familiar with. Good to
>> know.
>> >>
>> >> Thanks again for taking the time to respond,
>> >>
>> >> -Dan Nugent
>> >>
>> >>
>> >> On Mon, Mar 30, 2020 at 12:38 PM Wes McKinney <we...@gmail.com>
>> wrote:
>> >>>
>> >>> Social and technical reasons I guess. Empirically it's just not used
>> much.
>> >>>
>> >>> You can see my comments about numpy.ma in my 2010 paper about pandas
>> >>>
>> >>> https://conference.scipy.org/proceedings/scipy2010/pdfs/mckinney.pdf
>> >>>
>> >>> At least in 2010, there were notable performance problems when using
>> >>> MaskedArray for computations
>> >>>
>> >>> "We chose to use NaN as opposed to using NumPy MaskedArrays for
>> >>> performance reasons (which are beyond the scope of this paper), as NaN
>> >>> propagates in floating-point operations in a natural way and can be
>> >>> easily detected in algorithms."
>> >>>
>> >>> On Mon, Mar 30, 2020 at 11:20 AM Daniel Nugent <nu...@gmail.com>
>> wrote:
>> >>> >
>> >>> > Thanks! Since I'm just using it to jump to Arrow, I think I'll
>> stick with it.
>> >>> >
>> >>> > Do you have any feelings about why Numpy's masked arrays didn't
>> gain favor when many data representation formats explicitly support nullity
>> (including Arrow)? Is it just that not carrying nulls in computations
>> forward is preferable (that is, early filtering/value filling was easier)?
>> >>> >
>> >>> > -Dan Nugent
>> >>> >
>> >>> >
>> >>> > On Mon, Mar 30, 2020 at 11:40 AM Wes McKinney <we...@gmail.com>
>> wrote:
>> >>> >>
>> >>> >> On Mon, Mar 30, 2020 at 8:31 AM Daniel Nugent <nu...@gmail.com>
>> wrote:
>> >>> >> >
>> >>> >> > Didn’t want to follow up on this on the Jira issue earlier since
>> it's sort of tangential to that bug and more of a usage question. You said:
>> >>> >> >
>> >>> >> > > I wouldn't recommend building applications based on them
>> nowadays since the level of support / compatibility in other projects is
>> low.
>> >>> >> >
>> >>> >> > In my case, I am using them since it seemed like a
>> straightforward representation of my data that has nulls, the format I’m
>> converting from has zero cost numpy representations, and converting from an
>> internal format into Arrow in memory structures appears zero cost (or close
>> to it) as well. I guess I can just provide the mask as an explicit
>> argument, but my original desire to use it came from being able to exploit
>> numpy.ma.concatenate in a way that saved some complexity in implementation.
>> >>> >> >
>> >>> >> > Since Arrow itself supports masking values with a bitfield, is
>> there something intrinsic to the notion of array masks that is not well
>> supported? Or do you just mean the specific numpy MaskedArray class?
>> >>> >> >
>> >>> >>
>> >>> >> I mean just the numpy.ma module. Not many Python computing
>> projects
>> >>> >> nowadays treat MaskedArray objects as first class citizens.
>> Depending
>> >>> >> on what you need it may or may not be a problem. pyarrow supports
>> >>> >> ingesting from MaskedArray as a convenience, but it would not be
>> >>> >> common in my experience for a library's APIs to return
>> MaskedArrays.
>> >>> >>
>> >>> >> > If this is too much of a numpy question rather than an arrow
>> question, could you point me to where I can read up on masked array support
>> or maybe what the right place to ask the numpy community about whether what
>> I'm doing is appropriate or not.
>> >>> >> >
>> >>> >> > Thanks,
>> >>> >> >
>> >>> >> >
>> >>> >> > -Dan Nugent
>>
>

Re: Attn: Wes, Re: Masked Arrays

Posted by Felix Benning <fe...@gmail.com>.
Awesome, that was exactly what I was looking for, thank you!

On Sun, 5 Apr 2020 at 00:40, Wes McKinney <we...@gmail.com> wrote:

> I wrote a blog post a couple of years about this
>
> https://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
>
> Pasha Stetsenko did a follow-up analysis that showed that my
> "sentinel" code could be significantly improved, see:
>
> https://github.com/st-pasha/microbench-nas/blob/master/README.md
>
> Generally speaking in Apache Arrow we've been happy to have a uniform
> representation of nullness across all types, both primitive (booleans,
> numbers, or strings) and nested (lists, structs, unions, etc.). Many
> computational operations (like elementwise functions) need not concern
> themselves with the nulls at all, for example, since the bitmap from
> the input array can be passed along (with zero copy even) to the
> output array.
>
> On Sat, Apr 4, 2020 at 4:39 PM Felix Benning <fe...@gmail.com>
> wrote:
> >
> > Does anyone have an opinion (or links) about Bitpattern vs Masked Arrays
> for NA implementations? There seems to have been a discussion about that in
> the numpy community in 2012
> https://numpy.org/neps/nep-0026-missing-data-summary.html without an
> apparent result.
> >
> > Summary of the Summary:
> > - The Bitpattern approach reserves one bitpattern of any type as na, the
> only type not having spare bitpatterns are integers which means this
> decreases their range by one. This approach is taken by R and was regarded
> as more performant in 2012.
> > - The Mask approach was deemed more flexible, since it would allow
> "degrees of missingness", and also cleaner/easier implementation.
> >
> > Since bitpattern checks would probably disrupt SIMD, I feel like some
> calculations (e.g. mean) would actually benefit more, from setting na
> values to zero, proceeding as if they were not there, and using the number
> of nas in the metadata to adjust the result. This of course does not work
> if two columns are used (e.g. scalar product), which is probably more
> important.
> >
> > Was using Bitmasks in Arrow a conscious performance decision? Or was the
> decision only based on the fact, that R and Bitpattern implementations in
> general are a niche, which means that Bitmasks are more compatible with
> other languages?
> >
> > I am curious about this topic, since the "lack of proper na support" was
> cited as the reason, why Python would never replace R in statistics.
> >
> > Thanks,
> >
> > Felix
> >
> >
> > On 31.03.20 14:52, Joris Van den Bossche wrote:
> >
> > Note that pandas is starting to use a notion of "masked arrays" as well,
> for example for its nullable integer data type, but also not using the
> np.ma masked array, but a custom implementation (for technical reasons in
> pandas this was easier).
> >
> > Also, there has been quite some discussion last year in numpy about a
> possible re-implementation of a MaskedArray, but using numpy's protocols
> (`__array_ufunc__`, `__array_function__` etc), instead of being a subclass
> like np.ma now is. See eg
> https://mail.python.org/pipermail/numpy-discussion/2019-June/079681.html.
> >
> > Joris
> >
> > On Mon, 30 Mar 2020 at 18:57, Daniel Nugent <nu...@gmail.com> wrote:
> >>
> >> Ok. That actually aligns closely to what I'm familiar with. Good to
> know.
> >>
> >> Thanks again for taking the time to respond,
> >>
> >> -Dan Nugent
> >>
> >>
> >> On Mon, Mar 30, 2020 at 12:38 PM Wes McKinney <we...@gmail.com>
> wrote:
> >>>
> >>> Social and technical reasons I guess. Empirically it's just not used
> much.
> >>>
> >>> You can see my comments about numpy.ma in my 2010 paper about pandas
> >>>
> >>> https://conference.scipy.org/proceedings/scipy2010/pdfs/mckinney.pdf
> >>>
> >>> At least in 2010, there were notable performance problems when using
> >>> MaskedArray for computations
> >>>
> >>> "We chose to use NaN as opposed to using NumPy MaskedArrays for
> >>> performance reasons (which are beyond the scope of this paper), as NaN
> >>> propagates in floating-point operations in a natural way and can be
> >>> easily detected in algorithms."
> >>>
> >>> On Mon, Mar 30, 2020 at 11:20 AM Daniel Nugent <nu...@gmail.com>
> wrote:
> >>> >
> >>> > Thanks! Since I'm just using it to jump to Arrow, I think I'll stick
> with it.
> >>> >
> >>> > Do you have any feelings about why Numpy's masked arrays didn't gain
> favor when many data representation formats explicitly support nullity
> (including Arrow)? Is it just that not carrying nulls in computations
> forward is preferable (that is, early filtering/value filling was easier)?
> >>> >
> >>> > -Dan Nugent
> >>> >
> >>> >
> >>> > On Mon, Mar 30, 2020 at 11:40 AM Wes McKinney <we...@gmail.com>
> wrote:
> >>> >>
> >>> >> On Mon, Mar 30, 2020 at 8:31 AM Daniel Nugent <nu...@gmail.com>
> wrote:
> >>> >> >
> >>> >> > Didn’t want to follow up on this on the Jira issue earlier since
> it's sort of tangential to that bug and more of a usage question. You said:
> >>> >> >
> >>> >> > > I wouldn't recommend building applications based on them
> nowadays since the level of support / compatibility in other projects is
> low.
> >>> >> >
> >>> >> > In my case, I am using them since it seemed like a
> straightforward representation of my data that has nulls, the format I’m
> converting from has zero cost numpy representations, and converting from an
> internal format into Arrow in memory structures appears zero cost (or close
> to it) as well. I guess I can just provide the mask as an explicit
> argument, but my original desire to use it came from being able to exploit
> numpy.ma.concatenate in a way that saved some complexity in implementation.
> >>> >> >
> >>> >> > Since Arrow itself supports masking values with a bitfield, is
> there something intrinsic to the notion of array masks that is not well
> supported? Or do you just mean the specific numpy MaskedArray class?
> >>> >> >
> >>> >>
> >>> >> I mean just the numpy.ma module. Not many Python computing projects
> >>> >> nowadays treat MaskedArray objects as first class citizens.
> Depending
> >>> >> on what you need it may or may not be a problem. pyarrow supports
> >>> >> ingesting from MaskedArray as a convenience, but it would not be
> >>> >> common in my experience for a library's APIs to return MaskedArrays.
> >>> >>
> >>> >> > If this is too much of a numpy question rather than an arrow
> question, could you point me to where I can read up on masked array support
> or maybe what the right place to ask the numpy community about whether what
> I'm doing is appropriate or not.
> >>> >> >
> >>> >> > Thanks,
> >>> >> >
> >>> >> >
> >>> >> > -Dan Nugent
>

Re: Attn: Wes, Re: Masked Arrays

Posted by Wes McKinney <we...@gmail.com>.
I wrote a blog post a couple of years about this

https://wesmckinney.com/blog/bitmaps-vs-sentinel-values/

Pasha Stetsenko did a follow-up analysis that showed that my
"sentinel" code could be significantly improved, see:

https://github.com/st-pasha/microbench-nas/blob/master/README.md

Generally speaking in Apache Arrow we've been happy to have a uniform
representation of nullness across all types, both primitive (booleans,
numbers, or strings) and nested (lists, structs, unions, etc.). Many
computational operations (like elementwise functions) need not concern
themselves with the nulls at all, for example, since the bitmap from
the input array can be passed along (with zero copy even) to the
output array.

On Sat, Apr 4, 2020 at 4:39 PM Felix Benning <fe...@gmail.com> wrote:
>
> Does anyone have an opinion (or links) about Bitpattern vs Masked Arrays for NA implementations? There seems to have been a discussion about that in the numpy community in 2012 https://numpy.org/neps/nep-0026-missing-data-summary.html without an apparent result.
>
> Summary of the Summary:
> - The Bitpattern approach reserves one bitpattern of any type as na, the only type not having spare bitpatterns are integers which means this decreases their range by one. This approach is taken by R and was regarded as more performant in 2012.
> - The Mask approach was deemed more flexible, since it would allow "degrees of missingness", and also cleaner/easier implementation.
>
> Since bitpattern checks would probably disrupt SIMD, I feel like some calculations (e.g. mean) would actually benefit more, from setting na values to zero, proceeding as if they were not there, and using the number of nas in the metadata to adjust the result. This of course does not work if two columns are used (e.g. scalar product), which is probably more important.
>
> Was using Bitmasks in Arrow a conscious performance decision? Or was the decision only based on the fact, that R and Bitpattern implementations in general are a niche, which means that Bitmasks are more compatible with other languages?
>
> I am curious about this topic, since the "lack of proper na support" was cited as the reason, why Python would never replace R in statistics.
>
> Thanks,
>
> Felix
>
>
> On 31.03.20 14:52, Joris Van den Bossche wrote:
>
> Note that pandas is starting to use a notion of "masked arrays" as well, for example for its nullable integer data type, but also not using the np.ma masked array, but a custom implementation (for technical reasons in pandas this was easier).
>
> Also, there has been quite some discussion last year in numpy about a possible re-implementation of a MaskedArray, but using numpy's protocols (`__array_ufunc__`, `__array_function__` etc), instead of being a subclass like np.ma now is. See eg https://mail.python.org/pipermail/numpy-discussion/2019-June/079681.html.
>
> Joris
>
> On Mon, 30 Mar 2020 at 18:57, Daniel Nugent <nu...@gmail.com> wrote:
>>
>> Ok. That actually aligns closely to what I'm familiar with. Good to know.
>>
>> Thanks again for taking the time to respond,
>>
>> -Dan Nugent
>>
>>
>> On Mon, Mar 30, 2020 at 12:38 PM Wes McKinney <we...@gmail.com> wrote:
>>>
>>> Social and technical reasons I guess. Empirically it's just not used much.
>>>
>>> You can see my comments about numpy.ma in my 2010 paper about pandas
>>>
>>> https://conference.scipy.org/proceedings/scipy2010/pdfs/mckinney.pdf
>>>
>>> At least in 2010, there were notable performance problems when using
>>> MaskedArray for computations
>>>
>>> "We chose to use NaN as opposed to using NumPy MaskedArrays for
>>> performance reasons (which are beyond the scope of this paper), as NaN
>>> propagates in floating-point operations in a natural way and can be
>>> easily detected in algorithms."
>>>
>>> On Mon, Mar 30, 2020 at 11:20 AM Daniel Nugent <nu...@gmail.com> wrote:
>>> >
>>> > Thanks! Since I'm just using it to jump to Arrow, I think I'll stick with it.
>>> >
>>> > Do you have any feelings about why Numpy's masked arrays didn't gain favor when many data representation formats explicitly support nullity (including Arrow)? Is it just that not carrying nulls in computations forward is preferable (that is, early filtering/value filling was easier)?
>>> >
>>> > -Dan Nugent
>>> >
>>> >
>>> > On Mon, Mar 30, 2020 at 11:40 AM Wes McKinney <we...@gmail.com> wrote:
>>> >>
>>> >> On Mon, Mar 30, 2020 at 8:31 AM Daniel Nugent <nu...@gmail.com> wrote:
>>> >> >
>>> >> > Didn’t want to follow up on this on the Jira issue earlier since it's sort of tangential to that bug and more of a usage question. You said:
>>> >> >
>>> >> > > I wouldn't recommend building applications based on them nowadays since the level of support / compatibility in other projects is low.
>>> >> >
>>> >> > In my case, I am using them since it seemed like a straightforward representation of my data that has nulls, the format I’m converting from has zero cost numpy representations, and converting from an internal format into Arrow in memory structures appears zero cost (or close to it) as well. I guess I can just provide the mask as an explicit argument, but my original desire to use it came from being able to exploit numpy.ma.concatenate in a way that saved some complexity in implementation.
>>> >> >
>>> >> > Since Arrow itself supports masking values with a bitfield, is there something intrinsic to the notion of array masks that is not well supported? Or do you just mean the specific numpy MaskedArray class?
>>> >> >
>>> >>
>>> >> I mean just the numpy.ma module. Not many Python computing projects
>>> >> nowadays treat MaskedArray objects as first class citizens. Depending
>>> >> on what you need it may or may not be a problem. pyarrow supports
>>> >> ingesting from MaskedArray as a convenience, but it would not be
>>> >> common in my experience for a library's APIs to return MaskedArrays.
>>> >>
>>> >> > If this is too much of a numpy question rather than an arrow question, could you point me to where I can read up on masked array support or maybe what the right place to ask the numpy community about whether what I'm doing is appropriate or not.
>>> >> >
>>> >> > Thanks,
>>> >> >
>>> >> >
>>> >> > -Dan Nugent

Re: Attn: Wes, Re: Masked Arrays

Posted by Felix Benning <fe...@gmail.com>.
Does anyone have an opinion (or links) about Bitpattern vs Masked Arrays 
for NA implementations? There seems to have been a discussion about that 
in the numpy community in 2012 
https://numpy.org/neps/nep-0026-missing-data-summary.html without an 
apparent result.

Summary of the Summary:
- The Bitpattern approach reserves one bitpattern of any type as na, the 
only type not having spare bitpatterns are integers which means this 
decreases their range by one. This approach is taken by R and was 
regarded as more performant in 2012.
- The Mask approach was deemed more flexible, since it would allow 
"degrees of missingness", and also cleaner/easier implementation.

Since bitpattern checks would probably disrupt SIMD, I feel like some 
calculations (e.g. mean) would actually benefit more, from setting na 
values to zero, proceeding as if they were not there, and using the 
number of nas in the metadata to adjust the result. This of course does 
not work if two columns are used (e.g. scalar product), which is 
probably more important.

Was using Bitmasks in Arrow a conscious performance decision? Or was the 
decision only based on the fact, that R and Bitpattern implementations 
in general are a niche, which means that Bitmasks are more compatible 
with other languages?

I am curious about this topic, since the "lack of proper na support" was 
cited as the reason, why Python would never replace R in statistics.

Thanks,

Felix


On 31.03.20 14:52, Joris Van den Bossche wrote:
> Note that pandas is starting to use a notion of "masked arrays" as 
> well, for example for its nullable integer data type, but also not 
> using the np.ma <http://np.ma> masked array, but a custom 
> implementation (for technical reasons in pandas this was easier).
>
> Also, there has been quite some discussion last year in numpy about a 
> possible re-implementation of a MaskedArray, but using numpy's 
> protocols (`__array_ufunc__`, `__array_function__` etc), instead of 
> being a subclass like np.ma <http://np.ma> now is. See eg 
> https://mail.python.org/pipermail/numpy-discussion/2019-June/079681.html.
>
> Joris
>
> On Mon, 30 Mar 2020 at 18:57, Daniel Nugent <nugend@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Ok. That actually aligns closely to what I'm familiar with. Good
>     to know.
>
>     Thanks again for taking the time to respond,
>
>     -Dan Nugent
>
>
>     On Mon, Mar 30, 2020 at 12:38 PM Wes McKinney <wesmckinn@gmail.com
>     <ma...@gmail.com>> wrote:
>
>         Social and technical reasons I guess. Empirically it's just
>         not used much.
>
>         You can see my comments about numpy.ma <http://numpy.ma> in my
>         2010 paper about pandas
>
>         https://conference.scipy.org/proceedings/scipy2010/pdfs/mckinney.pdf
>
>         At least in 2010, there were notable performance problems when
>         using
>         MaskedArray for computations
>
>         "We chose to use NaN as opposed to using NumPy MaskedArrays for
>         performance reasons (which are beyond the scope of this
>         paper), as NaN
>         propagates in floating-point operations in a natural way and
>         can be
>         easily detected in algorithms."
>
>         On Mon, Mar 30, 2020 at 11:20 AM Daniel Nugent
>         <nugend@gmail.com <ma...@gmail.com>> wrote:
>         >
>         > Thanks! Since I'm just using it to jump to Arrow, I think
>         I'll stick with it.
>         >
>         > Do you have any feelings about why Numpy's masked arrays
>         didn't gain favor when many data representation formats
>         explicitly support nullity (including Arrow)? Is it just that
>         not carrying nulls in computations forward is preferable (that
>         is, early filtering/value filling was easier)?
>         >
>         > -Dan Nugent
>         >
>         >
>         > On Mon, Mar 30, 2020 at 11:40 AM Wes McKinney
>         <wesmckinn@gmail.com <ma...@gmail.com>> wrote:
>         >>
>         >> On Mon, Mar 30, 2020 at 8:31 AM Daniel Nugent
>         <nugend@gmail.com <ma...@gmail.com>> wrote:
>         >> >
>         >> > Didn’t want to follow up on this on the Jira issue
>         earlier since it's sort of tangential to that bug and more of
>         a usage question. You said:
>         >> >
>         >> > > I wouldn't recommend building applications based on
>         them nowadays since the level of support / compatibility in
>         other projects is low.
>         >> >
>         >> > In my case, I am using them since it seemed like a
>         straightforward representation of my data that has nulls, the
>         format I’m converting from has zero cost numpy
>         representations, and converting from an internal format into
>         Arrow in memory structures appears zero cost (or close to it)
>         as well. I guess I can just provide the mask as an explicit
>         argument, but my original desire to use it came from being
>         able to exploit numpy.ma.concatenate in a way that saved some
>         complexity in implementation.
>         >> >
>         >> > Since Arrow itself supports masking values with a
>         bitfield, is there something intrinsic to the notion of array
>         masks that is not well supported? Or do you just mean the
>         specific numpy MaskedArray class?
>         >> >
>         >>
>         >> I mean just the numpy.ma <http://numpy.ma> module. Not many
>         Python computing projects
>         >> nowadays treat MaskedArray objects as first class citizens.
>         Depending
>         >> on what you need it may or may not be a problem. pyarrow
>         supports
>         >> ingesting from MaskedArray as a convenience, but it would
>         not be
>         >> common in my experience for a library's APIs to return
>         MaskedArrays.
>         >>
>         >> > If this is too much of a numpy question rather than an
>         arrow question, could you point me to where I can read up on
>         masked array support or maybe what the right place to ask the
>         numpy community about whether what I'm doing is appropriate or
>         not.
>         >> >
>         >> > Thanks,
>         >> >
>         >> >
>         >> > -Dan Nugent
>

Re: Attn: Wes, Re: Masked Arrays

Posted by Joris Van den Bossche <jo...@gmail.com>.
Note that pandas is starting to use a notion of "masked arrays" as well,
for example for its nullable integer data type, but also not using the np.ma
masked array, but a custom implementation (for technical reasons in pandas
this was easier).

Also, there has been quite some discussion last year in numpy about a
possible re-implementation of a MaskedArray, but using numpy's protocols
(`__array_ufunc__`, `__array_function__` etc), instead of being a subclass
like np.ma now is. See eg
https://mail.python.org/pipermail/numpy-discussion/2019-June/079681.html.

Joris

On Mon, 30 Mar 2020 at 18:57, Daniel Nugent <nu...@gmail.com> wrote:

> Ok. That actually aligns closely to what I'm familiar with. Good to know.
>
> Thanks again for taking the time to respond,
>
> -Dan Nugent
>
>
> On Mon, Mar 30, 2020 at 12:38 PM Wes McKinney <we...@gmail.com> wrote:
>
>> Social and technical reasons I guess. Empirically it's just not used much.
>>
>> You can see my comments about numpy.ma in my 2010 paper about pandas
>>
>> https://conference.scipy.org/proceedings/scipy2010/pdfs/mckinney.pdf
>>
>> At least in 2010, there were notable performance problems when using
>> MaskedArray for computations
>>
>> "We chose to use NaN as opposed to using NumPy MaskedArrays for
>> performance reasons (which are beyond the scope of this paper), as NaN
>> propagates in floating-point operations in a natural way and can be
>> easily detected in algorithms."
>>
>> On Mon, Mar 30, 2020 at 11:20 AM Daniel Nugent <nu...@gmail.com> wrote:
>> >
>> > Thanks! Since I'm just using it to jump to Arrow, I think I'll stick
>> with it.
>> >
>> > Do you have any feelings about why Numpy's masked arrays didn't gain
>> favor when many data representation formats explicitly support nullity
>> (including Arrow)? Is it just that not carrying nulls in computations
>> forward is preferable (that is, early filtering/value filling was easier)?
>> >
>> > -Dan Nugent
>> >
>> >
>> > On Mon, Mar 30, 2020 at 11:40 AM Wes McKinney <we...@gmail.com>
>> wrote:
>> >>
>> >> On Mon, Mar 30, 2020 at 8:31 AM Daniel Nugent <nu...@gmail.com>
>> wrote:
>> >> >
>> >> > Didn’t want to follow up on this on the Jira issue earlier since
>> it's sort of tangential to that bug and more of a usage question. You said:
>> >> >
>> >> > > I wouldn't recommend building applications based on them nowadays
>> since the level of support / compatibility in other projects is low.
>> >> >
>> >> > In my case, I am using them since it seemed like a straightforward
>> representation of my data that has nulls, the format I’m converting from
>> has zero cost numpy representations, and converting from an internal format
>> into Arrow in memory structures appears zero cost (or close to it) as well.
>> I guess I can just provide the mask as an explicit argument, but my
>> original desire to use it came from being able to exploit
>> numpy.ma.concatenate in a way that saved some complexity in implementation.
>> >> >
>> >> > Since Arrow itself supports masking values with a bitfield, is there
>> something intrinsic to the notion of array masks that is not well
>> supported? Or do you just mean the specific numpy MaskedArray class?
>> >> >
>> >>
>> >> I mean just the numpy.ma module. Not many Python computing projects
>> >> nowadays treat MaskedArray objects as first class citizens. Depending
>> >> on what you need it may or may not be a problem. pyarrow supports
>> >> ingesting from MaskedArray as a convenience, but it would not be
>> >> common in my experience for a library's APIs to return MaskedArrays.
>> >>
>> >> > If this is too much of a numpy question rather than an arrow
>> question, could you point me to where I can read up on masked array support
>> or maybe what the right place to ask the numpy community about whether what
>> I'm doing is appropriate or not.
>> >> >
>> >> > Thanks,
>> >> >
>> >> >
>> >> > -Dan Nugent
>>
>

Re: Attn: Wes, Re: Masked Arrays

Posted by Daniel Nugent <nu...@gmail.com>.
Ok. That actually aligns closely to what I'm familiar with. Good to know.

Thanks again for taking the time to respond,

-Dan Nugent


On Mon, Mar 30, 2020 at 12:38 PM Wes McKinney <we...@gmail.com> wrote:

> Social and technical reasons I guess. Empirically it's just not used much.
>
> You can see my comments about numpy.ma in my 2010 paper about pandas
>
> https://conference.scipy.org/proceedings/scipy2010/pdfs/mckinney.pdf
>
> At least in 2010, there were notable performance problems when using
> MaskedArray for computations
>
> "We chose to use NaN as opposed to using NumPy MaskedArrays for
> performance reasons (which are beyond the scope of this paper), as NaN
> propagates in floating-point operations in a natural way and can be
> easily detected in algorithms."
>
> On Mon, Mar 30, 2020 at 11:20 AM Daniel Nugent <nu...@gmail.com> wrote:
> >
> > Thanks! Since I'm just using it to jump to Arrow, I think I'll stick
> with it.
> >
> > Do you have any feelings about why Numpy's masked arrays didn't gain
> favor when many data representation formats explicitly support nullity
> (including Arrow)? Is it just that not carrying nulls in computations
> forward is preferable (that is, early filtering/value filling was easier)?
> >
> > -Dan Nugent
> >
> >
> > On Mon, Mar 30, 2020 at 11:40 AM Wes McKinney <we...@gmail.com>
> wrote:
> >>
> >> On Mon, Mar 30, 2020 at 8:31 AM Daniel Nugent <nu...@gmail.com> wrote:
> >> >
> >> > Didn’t want to follow up on this on the Jira issue earlier since it's
> sort of tangential to that bug and more of a usage question. You said:
> >> >
> >> > > I wouldn't recommend building applications based on them nowadays
> since the level of support / compatibility in other projects is low.
> >> >
> >> > In my case, I am using them since it seemed like a straightforward
> representation of my data that has nulls, the format I’m converting from
> has zero cost numpy representations, and converting from an internal format
> into Arrow in memory structures appears zero cost (or close to it) as well.
> I guess I can just provide the mask as an explicit argument, but my
> original desire to use it came from being able to exploit
> numpy.ma.concatenate in a way that saved some complexity in implementation.
> >> >
> >> > Since Arrow itself supports masking values with a bitfield, is there
> something intrinsic to the notion of array masks that is not well
> supported? Or do you just mean the specific numpy MaskedArray class?
> >> >
> >>
> >> I mean just the numpy.ma module. Not many Python computing projects
> >> nowadays treat MaskedArray objects as first class citizens. Depending
> >> on what you need it may or may not be a problem. pyarrow supports
> >> ingesting from MaskedArray as a convenience, but it would not be
> >> common in my experience for a library's APIs to return MaskedArrays.
> >>
> >> > If this is too much of a numpy question rather than an arrow
> question, could you point me to where I can read up on masked array support
> or maybe what the right place to ask the numpy community about whether what
> I'm doing is appropriate or not.
> >> >
> >> > Thanks,
> >> >
> >> >
> >> > -Dan Nugent
>

Re: Attn: Wes, Re: Masked Arrays

Posted by Wes McKinney <we...@gmail.com>.
Social and technical reasons I guess. Empirically it's just not used much.

You can see my comments about numpy.ma in my 2010 paper about pandas

https://conference.scipy.org/proceedings/scipy2010/pdfs/mckinney.pdf

At least in 2010, there were notable performance problems when using
MaskedArray for computations

"We chose to use NaN as opposed to using NumPy MaskedArrays for
performance reasons (which are beyond the scope of this paper), as NaN
propagates in floating-point operations in a natural way and can be
easily detected in algorithms."

On Mon, Mar 30, 2020 at 11:20 AM Daniel Nugent <nu...@gmail.com> wrote:
>
> Thanks! Since I'm just using it to jump to Arrow, I think I'll stick with it.
>
> Do you have any feelings about why Numpy's masked arrays didn't gain favor when many data representation formats explicitly support nullity (including Arrow)? Is it just that not carrying nulls in computations forward is preferable (that is, early filtering/value filling was easier)?
>
> -Dan Nugent
>
>
> On Mon, Mar 30, 2020 at 11:40 AM Wes McKinney <we...@gmail.com> wrote:
>>
>> On Mon, Mar 30, 2020 at 8:31 AM Daniel Nugent <nu...@gmail.com> wrote:
>> >
>> > Didn’t want to follow up on this on the Jira issue earlier since it's sort of tangential to that bug and more of a usage question. You said:
>> >
>> > > I wouldn't recommend building applications based on them nowadays since the level of support / compatibility in other projects is low.
>> >
>> > In my case, I am using them since it seemed like a straightforward representation of my data that has nulls, the format I’m converting from has zero cost numpy representations, and converting from an internal format into Arrow in memory structures appears zero cost (or close to it) as well. I guess I can just provide the mask as an explicit argument, but my original desire to use it came from being able to exploit numpy.ma.concatenate in a way that saved some complexity in implementation.
>> >
>> > Since Arrow itself supports masking values with a bitfield, is there something intrinsic to the notion of array masks that is not well supported? Or do you just mean the specific numpy MaskedArray class?
>> >
>>
>> I mean just the numpy.ma module. Not many Python computing projects
>> nowadays treat MaskedArray objects as first class citizens. Depending
>> on what you need it may or may not be a problem. pyarrow supports
>> ingesting from MaskedArray as a convenience, but it would not be
>> common in my experience for a library's APIs to return MaskedArrays.
>>
>> > If this is too much of a numpy question rather than an arrow question, could you point me to where I can read up on masked array support or maybe what the right place to ask the numpy community about whether what I'm doing is appropriate or not.
>> >
>> > Thanks,
>> >
>> >
>> > -Dan Nugent

Re: Attn: Wes, Re: Masked Arrays

Posted by Daniel Nugent <nu...@gmail.com>.
Thanks! Since I'm just using it to jump to Arrow, I think I'll stick with
it.

Do you have any feelings about why Numpy's masked arrays didn't gain favor
when many data representation formats explicitly support nullity (including
Arrow)? Is it just that not carrying nulls in computations forward is
preferable (that is, early filtering/value filling was easier)?

-Dan Nugent


On Mon, Mar 30, 2020 at 11:40 AM Wes McKinney <we...@gmail.com> wrote:

> On Mon, Mar 30, 2020 at 8:31 AM Daniel Nugent <nu...@gmail.com> wrote:
> >
> > Didn’t want to follow up on this on the Jira issue earlier since it's
> sort of tangential to that bug and more of a usage question. You said:
> >
> > > I wouldn't recommend building applications based on them nowadays
> since the level of support / compatibility in other projects is low.
> >
> > In my case, I am using them since it seemed like a straightforward
> representation of my data that has nulls, the format I’m converting from
> has zero cost numpy representations, and converting from an internal format
> into Arrow in memory structures appears zero cost (or close to it) as well.
> I guess I can just provide the mask as an explicit argument, but my
> original desire to use it came from being able to exploit
> numpy.ma.concatenate in a way that saved some complexity in implementation.
> >
> > Since Arrow itself supports masking values with a bitfield, is there
> something intrinsic to the notion of array masks that is not well
> supported? Or do you just mean the specific numpy MaskedArray class?
> >
>
> I mean just the numpy.ma module. Not many Python computing projects
> nowadays treat MaskedArray objects as first class citizens. Depending
> on what you need it may or may not be a problem. pyarrow supports
> ingesting from MaskedArray as a convenience, but it would not be
> common in my experience for a library's APIs to return MaskedArrays.
>
> > If this is too much of a numpy question rather than an arrow question,
> could you point me to where I can read up on masked array support or maybe
> what the right place to ask the numpy community about whether what I'm
> doing is appropriate or not.
> >
> > Thanks,
> >
> >
> > -Dan Nugent
>

Re: Attn: Wes, Re: Masked Arrays

Posted by Wes McKinney <we...@gmail.com>.
On Mon, Mar 30, 2020 at 8:31 AM Daniel Nugent <nu...@gmail.com> wrote:
>
> Didn’t want to follow up on this on the Jira issue earlier since it's sort of tangential to that bug and more of a usage question. You said:
>
> > I wouldn't recommend building applications based on them nowadays since the level of support / compatibility in other projects is low.
>
> In my case, I am using them since it seemed like a straightforward representation of my data that has nulls, the format I’m converting from has zero cost numpy representations, and converting from an internal format into Arrow in memory structures appears zero cost (or close to it) as well. I guess I can just provide the mask as an explicit argument, but my original desire to use it came from being able to exploit numpy.ma.concatenate in a way that saved some complexity in implementation.
>
> Since Arrow itself supports masking values with a bitfield, is there something intrinsic to the notion of array masks that is not well supported? Or do you just mean the specific numpy MaskedArray class?
>

I mean just the numpy.ma module. Not many Python computing projects
nowadays treat MaskedArray objects as first class citizens. Depending
on what you need it may or may not be a problem. pyarrow supports
ingesting from MaskedArray as a convenience, but it would not be
common in my experience for a library's APIs to return MaskedArrays.

> If this is too much of a numpy question rather than an arrow question, could you point me to where I can read up on masked array support or maybe what the right place to ask the numpy community about whether what I'm doing is appropriate or not.
>
> Thanks,
>
>
> -Dan Nugent

Re: Attn: Wes, Re: Masked Arrays

Posted by Daniel Nugent <nu...@gmail.com>.
Shoot, sorry, there's a typo in there:

> converting from an internal format into Arrow in memory structures
appears zero cos

should be

> converting from numpy arrays into Arrow in memory structures appears zero
cost

-Dan Nugent


On Mon, Mar 30, 2020 at 9:31 AM Daniel Nugent <nu...@gmail.com> wrote:

> Didn’t want to follow up on this on the Jira issue earlier since it's sort
> of tangential to that bug and more of a usage question. You said:
>
> > I wouldn't recommend building applications based on them nowadays since
> the level of support / compatibility in other projects is low.
>
> In my case, I am using them since it seemed like a straightforward
> representation of my data that has nulls, the format I’m converting from
> has zero cost numpy representations, and converting from an internal format
> into Arrow in memory structures appears zero cost (or close to it) as well.
> I guess I can just provide the mask as an explicit argument, but my
> original desire to use it came from being able to exploit
> numpy.ma.concatenate in a way that saved some complexity in implementation.
>
> Since Arrow itself supports masking values with a bitfield, is there
> something intrinsic to the notion of array masks that is not well
> supported? Or do you just mean the specific numpy MaskedArray class?
>
> If this is too much of a numpy question rather than an arrow question,
> could you point me to where I can read up on masked array support or maybe
> what the right place to ask the numpy community about whether what I'm
> doing is appropriate or not.
>
> Thanks,
>
>
> -Dan Nugent
>