You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Frank McQuillan (JIRA)" <ji...@apache.org> on 2019/02/07 19:31:00 UTC

[jira] [Issue Comment Deleted] (MADLIB-1294) Clarify dep and indep var column names in output table for minibatch preprocessor

     [ https://issues.apache.org/jira/browse/MADLIB-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Frank McQuillan updated MADLIB-1294:
------------------------------------
    Comment: was deleted

(was: Hmm, I guess I don't see why "independent_varname" and "dependent_varname" as generic outputs is an issue.  If that is the known behavior and it is well documented in the use docs, then is that not OK?

As we support more options in the future for specifying the independent vars in the input table (e.g., multi-columns each scalar values), then a generic output column name works OK, but trying to carry over some version of the input col names gets complex.)

> Clarify dep and indep var column names in output table for minibatch preprocessor
> ---------------------------------------------------------------------------------
>
>                 Key: MADLIB-1294
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1294
>             Project: Apache MADlib
>          Issue Type: Bug
>          Components: Module: Utilities
>            Reporter: Domino Valdano
>            Priority: Minor
>             Fix For: v2.0
>
>
> The minibatch preprocessor utility used for preparing input tables before training accepts  "independent_varname" and "dependent_varname" as parameters.
> I believe the original intention was to have these refer to the names of the columns in the input table as well as the output table generated from it.  However, there is a bug in the implementation where instead of writing out the output table columns as \{independent_varname} and \{dependent_varname} the curly braces were omitted, meaning whatever names were in the original table get wiped out and replaced by the literal strings 'independent_varname' and 'dependent_varname'.  
> This makes little sense for several reasons:
> 1.) The contents of these columns are data, not variable names, so they end up misnamed in the output.
> 2.) This forces you to pass the argument strings 'independent_varname' and 'dependent_varname' as the column names of the resulting batched table to the fit/train function it's going to be fed into.  In other words, if you're using the minibatch preprocessor, then these arguments to fit/train serve no purpose, since you always have to pass the same strings rather than a custom name.
> 3.) You can't pick your own names for these variables, unless you want to manually rename them every time after you run the minibatch preprocessor.
> Presently, we just finished making a similar minibatch preprocessing utility for deep learning support in madlib 1.16.  I'd like to avoid reproducing this bug in the new utility, but we don't want them to be incompatible so that means we need to either fix both the old and new or neither.  The only issue with fixing the old is that it's already been released that way.  So I'm opening this bug report as a way of soliciting community feedback on the issue.
> If there is anyone who knows of a reason why this should be viewed as a feature rather than a bug, or has a need for the functionality to remain the same going forward, please comment. Thanks!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)