You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "kevingurney (via GitHub)" <gi...@apache.org> on 2023/05/18 18:23:57 UTC

[GitHub] [arrow] kevingurney opened a new issue, #35676: [MATLAB] Add name-value pairs for controlling null value handling during construction of `arrow.array.Array`

kevingurney opened a new issue, #35676:
URL: https://github.com/apache/arrow/issues/35676

   ### Describe the enhancement requested
   
   This is a follow up to the initial null value handling support that was added in #35598.
   
   In order to give clients more flexibility in how null values in MATLAB arrays are detected when constructing an `arrow.array.Array`, it would be helpful to expose a few name-value pairs on the `arrow.array.Array` class (and concrete subclasses).
   
   **Two possible name-value pairs for handling null value detection when constructing an `arrow.array.Array` are described below.**
   
   ## `DetectNulls`
   
   **Supported values**: `true | false`
   
   `true` - "automatically" detect null values in the input MATLAB array based on the default value (if any) of `NullDetectionFcn`. For example, for `arrow.array.Float64Array`, `DetectNulls` would default to `true` and `NullDetectionFcn` would default to `@isnan`.  This would mean that any `NaN` values in the input MATLAB `double` array will be treated as null values when constructing an `arrow.array.Float64Array`.
   
   `false` - Do not "automatically" detect null values. For some types (e.g. `arrow.aray.ListArray`), if `DetectNulls = false` and there are nonconvertible values (e.g. `<missing>`) in the input MATLAB array, then an error would be thrown. We are still thinking through the design for how users can workaround this case.
   
   **Example:**
   
   ```matlab
   >> matlabArray = string(["A", missing, "C", missing])'
   
   matlabArray = 
   
     4x1 string array
   
       "A"
       <missing>
       "C"
       <missing>
   
   % Defaults to treating <missing> as null
   >> arrowArray = arrow.array.StringArray(matlabArray, DetectNulls=true)
   [
       "A",
       null,
       "C",
       null
   ]
   ```
   
   **Note**: it most likely makes sense for different `arrow.array.Array` subclasses to have different default values for `DetectNulls` and `NullDetectionFcn`. For example, it doesn't make sense to set `DetectNulls=true` by default for `arrow.array.Int8Array` since there is no concept of "null-ability" or "missing-ness" for MATLAB integer types. On the other hand, `ismissing(double([1, NaN, 3]))` in MATLAB returns `logical([0, 1, 0])` because `NaN` is treated as a "missing" value. See https://www.mathworks.com/help/matlab/data_analysis/missing-data-in-matlab.html for more information.
   
   ## `NullDetectionFcn`
   
   **Supported values**: `function_handle` that takes one input (a vector) and returns a `logical` vector
   
   A `function_handle` used for "detecting" values that should be treated as null when constructing an `arrow.array.Array`. For example, when set to `@isnan`, all `NaN` values in an input MATLAB `double` array would be treated as null when constructing an `arrow.array.Float64Array`.
   
   **Example:**
   
   ```matlab
   >> matlabArray = string(["A", "B", "INVALID", "D", "ERROR", "F"])'
   
   matlabArray = 
   
     6x1 string array
   
       "A"
       "B"
       "INVALID"
       "D"
       "ERROR"
       "F"
   
   >> nullDetectionFcn = @ (s) strcmp(s, "INVALID") || strcmp(s, "ERROR")   
   
   nullDetectionFcn =
   
     function_handle with value:
   
       @(s)strcmp(s,"INVALID")||strcmp(s,"ERROR")
   
   % Detects any strings with the values "INVALID" or "ERROR" as null values
   >> arrowArray = arrow.array.StringArray(matlabArray, NullDetectionFcn=nullDetectionFcn)
   [
       "A",
       "B",
       null,
       "D",
       null,
       "F"
   ]
   ```
   
   ---
   
   ### Component(s)
   
   MATLAB


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou closed issue #35676: [MATLAB] Add an `InferNulls` name-value pair for controlling null value inference during construction of `arrow.array.Array`

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou closed issue #35676: [MATLAB] Add an `InferNulls` name-value pair for controlling null value inference during construction of `arrow.array.Array`
URL: https://github.com/apache/arrow/issues/35676


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] sgilmore10 commented on issue #35676: [MATLAB] Add an `InferNulls` name-value pair for controlling null value inference during construction of `arrow.array.Array`

Posted by "sgilmore10 (via GitHub)" <gi...@apache.org>.
sgilmore10 commented on issue #35676:
URL: https://github.com/apache/arrow/issues/35676#issuecomment-1561308963

   take
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kevingurney commented on issue #35676: [MATLAB] Add an `InferNulls` name-value pair for controlling null value inference during construction of `arrow.array.Array`

Posted by "kevingurney (via GitHub)" <gi...@apache.org>.
kevingurney commented on issue #35676:
URL: https://github.com/apache/arrow/issues/35676#issuecomment-1561306093

   After some further consideration, it likely makes sense to simplify the proposed name-value pairs to only include `InferNulls = true | false` rather than `DetectNulls` and `NullDetectionFcn`.
   
   Rather than using a `function_handle`, clients can pre-compute null values using whatever approach they would like and then pass in a validity bitmap via the `Valid` name-value pair proposed in #35693.
   
   I've updated the issue title and description accordingly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org