You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@madlib.apache.org by Frank McQuillan <fm...@pivotal.io> on 2016/11/08 22:03:29 UTC
Re: Encoding categorical variables

Here is the JIRA with attached requirements doc.
https://issues.apache.org/jira/browse/MADLIB-1038

Please put your comments in the JIRA.  There are still some outstanding
questions to be puzzled out.

Frank

On Fri, Oct 28, 2016 at 3:04 PM, Frank McQuillan <fm...@pivotal.io>
wrote:

> Yes thanks Vatsan we have been looking at that.
>
> On Fri, Oct 28, 2016 at 2:39 PM, Srivatsan R <va...@gmail.com> wrote:
>
>> You guys may have already seen this, but linking just in case:
>> http://pandas.pydata.org/pandas-docs/stable/generated/pandas
>> .get_dummies.html
>>
>> On Fri, Oct 28, 2016 at 1:32 PM, Woo Jae Jung <wj...@pivotal.io> wrote:
>>
>> > +Vatsan for his thoughts as well!
>> >
>> > On Fri, Oct 28, 2016 at 1:29 PM, Woo Jae Jung <wj...@pivotal.io> wrote:
>> >
>> >> Also agree that double-quoted column names are not ideal.  In addition
>> to
>> >> the net-new features described in this thread, it'd be nice to see
>> >> non-double-quoted output as default behavior in the
>> >> existing create_indicator_variables() function.
>> >>
>> >> Thanks,
>> >> Woo
>> >>
>> >> On Fri, Oct 28, 2016 at 1:05 PM, Woo Jae Jung <wj...@pivotal.io>
>> wrote:
>> >>
>> >>> I like the one-hot encoded feature.  Another variant of this idea
>> would
>> >>> be an "all other" variable (distinct from the reference class) that
>> >>> contains occurrences of the less frequent category types.  In both of
>> these
>> >>> scenarios, the threshold for 'less frequent' could be user-supplied.
>> >>>
>> >>> Thanks,
>> >>> Woo
>> >>>
>> >>> On Fri, Oct 28, 2016 at 11:29 AM, Rahul Iyer <ra...@gmail.com>
>> >>> wrote:
>> >>>
>> >>>> An alternative to dropping is to assign the less frequent values to
>> the
>> >>>> reference i.e. all one-hot encoded features will be 0.
>> >>>> Also important to note: total runtime will increase with this option
>> >>>> since
>> >>>> we'll have to compute the exact frequency distribution.
>> >>>>
>> >>>> Another suggested change is to call this function 'one_hot_encoding'
>> >>>> since
>> >>>> that is the output here (similar to sklearn's OneHotEncoder
>> >>>> <http://scikit-learn.org/stable/modules/generated/sklearn.pr
>> >>>> eprocessing.OneHotEncoder.html>).
>> >>>> We can keep the current name as a deprecated alias till 2.0 is
>> released.
>> >>>>
>> >>>> On Fri, Oct 28, 2016 at 11:17 AM, Frank McQuillan <
>> >>>> fmcquillan@pivotal.io>
>> >>>> wrote:
>> >>>>
>> >>>> > Jarrod,
>> >>>> >
>> >>>> > Just trying to write up detailed requirements.  How would you see
>> >>>> this one
>> >>>> > working?
>> >>>> >
>> >>>> > "2) Option to dummy code only the top n most frequently occurring
>> >>>> values in
>> >>>> > any column"
>> >>>> >
>> >>>> > With 1 column I can picture it, you would drop the rows with the
>> less
>> >>>> > frequently occurring values and end up with a smaller table.  But
>> >>>> what if
>> >>>> > you are encoding multiple rows?    Would you want a per row
>> >>>> specification
>> >>>> > of n? i.e., top 3 values for column x, top 10 values for column y?
>> >>>> If you
>> >>>> > did this then your result set might include low frequency values
>> for
>> >>>> column
>> >>>> > x (not in top 3) because they are in the top 10 for column y - this
>> >>>> might
>> >>>> > be confusing.
>> >>>> >
>> >>>> > Frank
>> >>>> >
>> >>>> > On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan <
>> >>>> fmcquillan@pivotal.io>
>> >>>> > wrote:
>> >>>> >
>> >>>> >> great, thanks for the additional information
>> >>>> >>
>> >>>> >> Frank
>> >>>> >>
>> >>>> >> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey <
>> jvawdrey@pivotal.io
>> >>>> >
>> >>>> >> wrote:
>> >>>> >>
>> >>>> >>> IMO
>> >>>> >>>
>> >>>> >>> 1) Option to define resulting column names. Please see pdltools
>> >>>> >>> implementation - the ability to pass in a function is especially
>> >>>> useful (
>> >>>> >>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot0
>> 1.html)
>> >>>> >>> 2) Option to dummy code only the top n most frequently occurring
>> >>>> values
>> >>>> >>> in
>> >>>> >>> any column
>> >>>> >>> 3) Option to create numeric column names (E.g. pivotcol_val1,
>> >>>> >>> pivotcol_val2
>> >>>> >>> ...) instead of values in column names + secondary mapping table
>> >>>> >>> 4) Option to exclude original column from results table
>> >>>> >>>
>> >>>> >>> (1) & (2) are much higher priority than (3) & (4).
>> >>>> >>>
>> >>>> >>> Agreed that these could also be applied to Pivoting (especially
>> 1).
>> >>>> >>>
>> >>>> >>>
>> >>>> >>>
>> >>>> >>> Jarrod Vawdrey
>> >>>> >>> Sr. Data Scientist
>> >>>> >>> Data Science & Engineering | Pivotal
>> >>>> >>> (650) 315-8905
>> >>>> >>> https://pivotal.io/
>> >>>> >>>
>> >>>> >>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <
>> >>>> fmcquillan@pivotal.io>
>> >>>> >>> wrote:
>> >>>> >>>
>> >>>> >>> > Thanks for those suggestions, Jarrod.  They all sound pretty
>> >>>> useful -
>> >>>> >>> > would you mind taking a crack at numbering them 1,2,3... etc,
>> in
>> >>>> the
>> >>>> >>> order
>> >>>> >>> > of priority as you see it?
>> >>>> >>> >
>> >>>> >>> > Also it seems like some of these could be applied to the Pivot
>> >>>> >>> function as
>> >>>> >>> > well, e.g., UDF for column naming.
>> >>>> >>> >
>> >>>> >>> > Frank
>> >>>> >>> >
>> >>>> >>> >
>> >>>> >>> >
>> >>>> >>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <
>> >>>> jvawdrey@pivotal.io>
>> >>>> >>> > wrote:
>> >>>> >>> >
>> >>>> >>> >> Hey Frank,
>> >>>> >>> >>
>> >>>> >>> >> How are special character values handled today? It is often
>> not
>> >>>> ideal
>> >>>> >>> to
>> >>>> >>> >> end up with column names that require double quotes to call
>> due
>> >>>> to
>> >>>> >>> >> downstream scripts.
>> >>>> >>> >>
>> >>>> >>> >> A couple of features that would be useful
>> >>>> >>> >>
>> >>>> >>> >> * Option to define resulting column names. Please see pdltools
>> >>>> >>> >> implementation - the ability to pass in a function is
>> especially
>> >>>> >>> useful (
>> >>>> >>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot0
>> >>>> 1.html)
>> >>>> >>> >> * Option to dummy code only the top n most frequently
>> occurring
>> >>>> >>> values in
>> >>>> >>> >> any column
>> >>>> >>> >> * Option to exclude original column from results table
>> >>>> >>> >> * Option to create numeric column names (E.g. pivotcol_val1,
>> >>>> >>> >> pivotcol_val2 ...) instead of values in column names +
>> secondary
>> >>>> >>> mapping
>> >>>> >>> >> table
>> >>>> >>> >>
>> >>>> >>> >> Thank you
>> >>>> >>> >>
>> >>>> >>> >> Jarrod Vawdrey
>> >>>> >>> >> Sr. Data Scientist
>> >>>> >>> >> Data Science & Engineering | Pivotal
>> >>>> >>> >> (650) 315-8905
>> >>>> >>> >> https://pivotal.io/
>> >>>> >>> >>
>> >>>> >>> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <
>> >>>> >>> fmcquillan@pivotal.io>
>> >>>> >>> >> wrote:
>> >>>> >>> >>
>> >>>> >>> >>> For the module encoding categorical variables
>> >>>> >>> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d
>> >>>> >>> >>> ata__prep.html
>> >>>> >>> >>> does anyone have any suggestions on improvements that we
>> could
>> >>>> make?
>> >>>> >>> >>>
>> >>>> >>> >>> Here is a video on how encoding categorical variables works
>> for
>> >>>> >>> those not
>> >>>> >>> >>> familiar with it
>> >>>> >>> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
>> >>>> >>> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
>> >>>> >>> >>>
>> >>>> >>> >>
>> >>>> >>> >>
>> >>>> >>> >
>> >>>> >>>
>> >>>> >>
>> >>>> >>
>> >>>> >
>> >>>>
>> >>>
>> >>>
>> >>
>> >
>>
>
>