You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Frank McQuillan (JIRA)" <ji...@apache.org> on 2019/04/23 17:45:00 UTC

[jira] [Closed] (MADLIB-1303) Add 1-hot encoding to dependent variable in mini-batch preprocessor for images

     [ https://issues.apache.org/jira/browse/MADLIB-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Frank McQuillan closed MADLIB-1303.
-----------------------------------
    Resolution: Fixed

https://github.com/apache/madlib/pull/360

> Add 1-hot encoding to dependent variable in mini-batch preprocessor for images
> ------------------------------------------------------------------------------
>
>                 Key: MADLIB-1303
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1303
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Utilities
>            Reporter: Frank McQuillan
>            Priority: Minor
>             Fix For: v1.16
>
>
> Story
> As a data scientist, I want to have the mini-batch preprocessor 1-hot encode the dependent variable so that I don't need to do it myself.  This applies to all types: boolean and character types such as text, char and varchar, & integers and floats.
> If the dependent variable is already an array, then we assume it is already 1-hot encoded and we just cast it to int[] and pass it along. 
> We can remove the param `dependent_offset (optional)` from the current interface since 1-hot encoding is the more general solution.
> Open questions
> 1) Q: Can we just use the exact same 1-hot encoding as in
> http://madlib.apache.org/docs/latest/group__grp__minibatch__preprocessing.html
> ???
> i.e., add the param `one_hot_encode_int_dep_var (optional)`
> then we could use the same code that is already written and tested and such?
> A:  we can re-use the code to the extent possible, but we do not need this param.
> 2) Q: In the case where the dependent variable is already 1-hot encoded, this means need to support array input for dependent variable.  Also, should we just pass it thru or check for an array only with 1's and 0's?
> A: We will check first row but it does not guarantee all rows are correct.
> 3) Q: How to handle float?  If user wants to encode float values for some reason, they could cast them to text first.  Or just pass them along?  
> A:  If scalar float, we 1-hot encode (could be a valid case).  If float[], we cast to int[].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)