You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Frank McQuillan (JIRA)" <ji...@apache.org> on 2019/04/03 17:30:00 UTC

[jira] [Closed] (MADLIB-1314) Add optional num_classes param for minibatch preprocessor for DL

     [ https://issues.apache.org/jira/browse/MADLIB-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Frank McQuillan closed MADLIB-1314.
-----------------------------------
    Resolution: Fixed

> Add optional num_classes param for minibatch preprocessor for DL
> ----------------------------------------------------------------
>
>                 Key: MADLIB-1314
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1314
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Deep Learning, Module: Utilities
>            Reporter: Nandish Jayaram
>            Priority: Major
>             Fix For: v1.16
>
>
> The current `minibatch_preprocessor_dl` module looks at the input table to find the number of distinct categories (class values) for the dependent variable, and uses that number as the size of the one-hot-encoded array. This could lead a failure in madlib_keras fit function if the `num_classes` defined in the architecture is a number greater/different than the size of the one hot encoded array.
> This could be a fairly common scenario, for example:
> Say original data set is places 350, but we decide to sample a subset. That subset may not have all 350 classes (assume it has only 10 classes in it), but the model we have already defined is for places 350 (so num_classes there would be specified as 350, and the final layer would have that many units). So we will have to change the model architecture to work with this sampled dataset if we do not support this feature where we create one-hot encoded vector of size 350 despite finding only 10 class values in the input dataset.
> Acceptance:
> 1. Add optional `num_classes` param of type integer.
> 1. one hot encoded array must be of size `num_classes` if specified, else use the distinct number of class values for it.
> 1. Fail if `num_classes < distinct class values found in dataset`.
> 1. `class_values` column in summary table must have `NULL` as the entry for class values that do not exist in the input table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)