You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@madlib.apache.org by Frank McQuillan <fm...@pivotal.io> on 2016/10/14 19:35:41 UTC

Encoding categorical variables

For the module encoding categorical variables
http://madlib.incubator.apache.org/docs/latest/group__grp__data__prep.html
does anyone have any suggestions on improvements that we could make?

Here is a video on how encoding categorical variables works for those not
familiar with it
https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL62pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ

Re: Encoding categorical variables

Posted by Frank McQuillan <fm...@pivotal.io>.

Here is the JIRA with attached requirements doc.
https://issues.apache.org/jira/browse/MADLIB-1038

Please put your comments in the JIRA.  There are still some outstanding
questions to be puzzled out.

Frank

On Fri, Oct 28, 2016 at 3:04 PM, Frank McQuillan <fm...@pivotal.io>
wrote:

> Yes thanks Vatsan we have been looking at that.
>
> On Fri, Oct 28, 2016 at 2:39 PM, Srivatsan R <va...@gmail.com> wrote:
>
>> You guys may have already seen this, but linking just in case:
>> http://pandas.pydata.org/pandas-docs/stable/generated/pandas
>> .get_dummies.html
>>
>> On Fri, Oct 28, 2016 at 1:32 PM, Woo Jae Jung <wj...@pivotal.io> wrote:
>>
>> > +Vatsan for his thoughts as well!
>> >
>> > On Fri, Oct 28, 2016 at 1:29 PM, Woo Jae Jung <wj...@pivotal.io> wrote:
>> >
>> >> Also agree that double-quoted column names are not ideal.  In addition
>> to
>> >> the net-new features described in this thread, it'd be nice to see
>> >> non-double-quoted output as default behavior in the
>> >> existing create_indicator_variables() function.
>> >>
>> >> Thanks,
>> >> Woo
>> >>
>> >> On Fri, Oct 28, 2016 at 1:05 PM, Woo Jae Jung <wj...@pivotal.io>
>> wrote:
>> >>
>> >>> I like the one-hot encoded feature.  Another variant of this idea
>> would
>> >>> be an "all other" variable (distinct from the reference class) that
>> >>> contains occurrences of the less frequent category types.  In both of
>> these
>> >>> scenarios, the threshold for 'less frequent' could be user-supplied.
>> >>>
>> >>> Thanks,
>> >>> Woo
>> >>>
>> >>> On Fri, Oct 28, 2016 at 11:29 AM, Rahul Iyer <ra...@gmail.com>
>> >>> wrote:
>> >>>
>> >>>> An alternative to dropping is to assign the less frequent values to
>> the
>> >>>> reference i.e. all one-hot encoded features will be 0.
>> >>>> Also important to note: total runtime will increase with this option
>> >>>> since
>> >>>> we'll have to compute the exact frequency distribution.
>> >>>>
>> >>>> Another suggested change is to call this function 'one_hot_encoding'
>> >>>> since
>> >>>> that is the output here (similar to sklearn's OneHotEncoder
>> >>>> <http://scikit-learn.org/stable/modules/generated/sklearn.pr
>> >>>> eprocessing.OneHotEncoder.html>).
>> >>>> We can keep the current name as a deprecated alias till 2.0 is
>> released.
>> >>>>
>> >>>> On Fri, Oct 28, 2016 at 11:17 AM, Frank McQuillan <
>> >>>> fmcquillan@pivotal.io>
>> >>>> wrote:
>> >>>>
>> >>>> > Jarrod,
>> >>>> >
>> >>>> > Just trying to write up detailed requirements.  How would you see
>> >>>> this one
>> >>>> > working?
>> >>>> >
>> >>>> > "2) Option to dummy code only the top n most frequently occurring
>> >>>> values in
>> >>>> > any column"
>> >>>> >
>> >>>> > With 1 column I can picture it, you would drop the rows with the
>> less
>> >>>> > frequently occurring values and end up with a smaller table.  But
>> >>>> what if
>> >>>> > you are encoding multiple rows?    Would you want a per row
>> >>>> specification
>> >>>> > of n? i.e., top 3 values for column x, top 10 values for column y?
>> >>>> If you
>> >>>> > did this then your result set might include low frequency values
>> for
>> >>>> column
>> >>>> > x (not in top 3) because they are in the top 10 for column y - this
>> >>>> might
>> >>>> > be confusing.
>> >>>> >
>> >>>> > Frank
>> >>>> >
>> >>>> > On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan <
>> >>>> fmcquillan@pivotal.io>
>> >>>> > wrote:
>> >>>> >
>> >>>> >> great, thanks for the additional information
>> >>>> >>
>> >>>> >> Frank
>> >>>> >>
>> >>>> >> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey <
>> jvawdrey@pivotal.io
>> >>>> >
>> >>>> >> wrote:
>> >>>> >>
>> >>>> >>> IMO
>> >>>> >>>
>> >>>> >>> 1) Option to define resulting column names. Please see pdltools
>> >>>> >>> implementation - the ability to pass in a function is especially
>> >>>> useful (
>> >>>> >>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot0
>> 1.html)
>> >>>> >>> 2) Option to dummy code only the top n most frequently occurring
>> >>>> values
>> >>>> >>> in
>> >>>> >>> any column
>> >>>> >>> 3) Option to create numeric column names (E.g. pivotcol_val1,
>> >>>> >>> pivotcol_val2
>> >>>> >>> ...) instead of values in column names + secondary mapping table
>> >>>> >>> 4) Option to exclude original column from results table
>> >>>> >>>
>> >>>> >>> (1) & (2) are much higher priority than (3) & (4).
>> >>>> >>>
>> >>>> >>> Agreed that these could also be applied to Pivoting (especially
>> 1).
>> >>>> >>>
>> >>>> >>>
>> >>>> >>>
>> >>>> >>> Jarrod Vawdrey
>> >>>> >>> Sr. Data Scientist
>> >>>> >>> Data Science & Engineering | Pivotal
>> >>>> >>> (650) 315-8905
>> >>>> >>> https://pivotal.io/
>> >>>> >>>
>> >>>> >>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <
>> >>>> fmcquillan@pivotal.io>
>> >>>> >>> wrote:
>> >>>> >>>
>> >>>> >>> > Thanks for those suggestions, Jarrod.  They all sound pretty
>> >>>> useful -
>> >>>> >>> > would you mind taking a crack at numbering them 1,2,3... etc,
>> in
>> >>>> the
>> >>>> >>> order
>> >>>> >>> > of priority as you see it?
>> >>>> >>> >
>> >>>> >>> > Also it seems like some of these could be applied to the Pivot
>> >>>> >>> function as
>> >>>> >>> > well, e.g., UDF for column naming.
>> >>>> >>> >
>> >>>> >>> > Frank
>> >>>> >>> >
>> >>>> >>> >
>> >>>> >>> >
>> >>>> >>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <
>> >>>> jvawdrey@pivotal.io>
>> >>>> >>> > wrote:
>> >>>> >>> >
>> >>>> >>> >> Hey Frank,
>> >>>> >>> >>
>> >>>> >>> >> How are special character values handled today? It is often
>> not
>> >>>> ideal
>> >>>> >>> to
>> >>>> >>> >> end up with column names that require double quotes to call
>> due
>> >>>> to
>> >>>> >>> >> downstream scripts.
>> >>>> >>> >>
>> >>>> >>> >> A couple of features that would be useful
>> >>>> >>> >>
>> >>>> >>> >> * Option to define resulting column names. Please see pdltools
>> >>>> >>> >> implementation - the ability to pass in a function is
>> especially
>> >>>> >>> useful (
>> >>>> >>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot0
>> >>>> 1.html)
>> >>>> >>> >> * Option to dummy code only the top n most frequently
>> occurring
>> >>>> >>> values in
>> >>>> >>> >> any column
>> >>>> >>> >> * Option to exclude original column from results table
>> >>>> >>> >> * Option to create numeric column names (E.g. pivotcol_val1,
>> >>>> >>> >> pivotcol_val2 ...) instead of values in column names +
>> secondary
>> >>>> >>> mapping
>> >>>> >>> >> table
>> >>>> >>> >>
>> >>>> >>> >> Thank you
>> >>>> >>> >>
>> >>>> >>> >> Jarrod Vawdrey
>> >>>> >>> >> Sr. Data Scientist
>> >>>> >>> >> Data Science & Engineering | Pivotal
>> >>>> >>> >> (650) 315-8905
>> >>>> >>> >> https://pivotal.io/
>> >>>> >>> >>
>> >>>> >>> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <
>> >>>> >>> fmcquillan@pivotal.io>
>> >>>> >>> >> wrote:
>> >>>> >>> >>
>> >>>> >>> >>> For the module encoding categorical variables
>> >>>> >>> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d
>> >>>> >>> >>> ata__prep.html
>> >>>> >>> >>> does anyone have any suggestions on improvements that we
>> could
>> >>>> make?
>> >>>> >>> >>>
>> >>>> >>> >>> Here is a video on how encoding categorical variables works
>> for
>> >>>> >>> those not
>> >>>> >>> >>> familiar with it
>> >>>> >>> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
>> >>>> >>> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
>> >>>> >>> >>>
>> >>>> >>> >>
>> >>>> >>> >>
>> >>>> >>> >
>> >>>> >>>
>> >>>> >>
>> >>>> >>
>> >>>> >
>> >>>>
>> >>>
>> >>>
>> >>
>> >
>>
>
>

Re: Encoding categorical variables

Posted by Frank McQuillan <fm...@pivotal.io>.

Here is the JIRA with attached requirements doc.
https://issues.apache.org/jira/browse/MADLIB-1038

Please put your comments in the JIRA.  There are still some outstanding
questions to be puzzled out.

Frank

On Fri, Oct 28, 2016 at 3:04 PM, Frank McQuillan <fm...@pivotal.io>
wrote:

> Yes thanks Vatsan we have been looking at that.
>
> On Fri, Oct 28, 2016 at 2:39 PM, Srivatsan R <va...@gmail.com> wrote:
>
>> You guys may have already seen this, but linking just in case:
>> http://pandas.pydata.org/pandas-docs/stable/generated/pandas
>> .get_dummies.html
>>
>> On Fri, Oct 28, 2016 at 1:32 PM, Woo Jae Jung <wj...@pivotal.io> wrote:
>>
>> > +Vatsan for his thoughts as well!
>> >
>> > On Fri, Oct 28, 2016 at 1:29 PM, Woo Jae Jung <wj...@pivotal.io> wrote:
>> >
>> >> Also agree that double-quoted column names are not ideal.  In addition
>> to
>> >> the net-new features described in this thread, it'd be nice to see
>> >> non-double-quoted output as default behavior in the
>> >> existing create_indicator_variables() function.
>> >>
>> >> Thanks,
>> >> Woo
>> >>
>> >> On Fri, Oct 28, 2016 at 1:05 PM, Woo Jae Jung <wj...@pivotal.io>
>> wrote:
>> >>
>> >>> I like the one-hot encoded feature.  Another variant of this idea
>> would
>> >>> be an "all other" variable (distinct from the reference class) that
>> >>> contains occurrences of the less frequent category types.  In both of
>> these
>> >>> scenarios, the threshold for 'less frequent' could be user-supplied.
>> >>>
>> >>> Thanks,
>> >>> Woo
>> >>>
>> >>> On Fri, Oct 28, 2016 at 11:29 AM, Rahul Iyer <ra...@gmail.com>
>> >>> wrote:
>> >>>
>> >>>> An alternative to dropping is to assign the less frequent values to
>> the
>> >>>> reference i.e. all one-hot encoded features will be 0.
>> >>>> Also important to note: total runtime will increase with this option
>> >>>> since
>> >>>> we'll have to compute the exact frequency distribution.
>> >>>>
>> >>>> Another suggested change is to call this function 'one_hot_encoding'
>> >>>> since
>> >>>> that is the output here (similar to sklearn's OneHotEncoder
>> >>>> <http://scikit-learn.org/stable/modules/generated/sklearn.pr
>> >>>> eprocessing.OneHotEncoder.html>).
>> >>>> We can keep the current name as a deprecated alias till 2.0 is
>> released.
>> >>>>
>> >>>> On Fri, Oct 28, 2016 at 11:17 AM, Frank McQuillan <
>> >>>> fmcquillan@pivotal.io>
>> >>>> wrote:
>> >>>>
>> >>>> > Jarrod,
>> >>>> >
>> >>>> > Just trying to write up detailed requirements.  How would you see
>> >>>> this one
>> >>>> > working?
>> >>>> >
>> >>>> > "2) Option to dummy code only the top n most frequently occurring
>> >>>> values in
>> >>>> > any column"
>> >>>> >
>> >>>> > With 1 column I can picture it, you would drop the rows with the
>> less
>> >>>> > frequently occurring values and end up with a smaller table.  But
>> >>>> what if
>> >>>> > you are encoding multiple rows?    Would you want a per row
>> >>>> specification
>> >>>> > of n? i.e., top 3 values for column x, top 10 values for column y?
>> >>>> If you
>> >>>> > did this then your result set might include low frequency values
>> for
>> >>>> column
>> >>>> > x (not in top 3) because they are in the top 10 for column y - this
>> >>>> might
>> >>>> > be confusing.
>> >>>> >
>> >>>> > Frank
>> >>>> >
>> >>>> > On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan <
>> >>>> fmcquillan@pivotal.io>
>> >>>> > wrote:
>> >>>> >
>> >>>> >> great, thanks for the additional information
>> >>>> >>
>> >>>> >> Frank
>> >>>> >>
>> >>>> >> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey <
>> jvawdrey@pivotal.io
>> >>>> >
>> >>>> >> wrote:
>> >>>> >>
>> >>>> >>> IMO
>> >>>> >>>
>> >>>> >>> 1) Option to define resulting column names. Please see pdltools
>> >>>> >>> implementation - the ability to pass in a function is especially
>> >>>> useful (
>> >>>> >>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot0
>> 1.html)
>> >>>> >>> 2) Option to dummy code only the top n most frequently occurring
>> >>>> values
>> >>>> >>> in
>> >>>> >>> any column
>> >>>> >>> 3) Option to create numeric column names (E.g. pivotcol_val1,
>> >>>> >>> pivotcol_val2
>> >>>> >>> ...) instead of values in column names + secondary mapping table
>> >>>> >>> 4) Option to exclude original column from results table
>> >>>> >>>
>> >>>> >>> (1) & (2) are much higher priority than (3) & (4).
>> >>>> >>>
>> >>>> >>> Agreed that these could also be applied to Pivoting (especially
>> 1).
>> >>>> >>>
>> >>>> >>>
>> >>>> >>>
>> >>>> >>> Jarrod Vawdrey
>> >>>> >>> Sr. Data Scientist
>> >>>> >>> Data Science & Engineering | Pivotal
>> >>>> >>> (650) 315-8905
>> >>>> >>> https://pivotal.io/
>> >>>> >>>
>> >>>> >>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <
>> >>>> fmcquillan@pivotal.io>
>> >>>> >>> wrote:
>> >>>> >>>
>> >>>> >>> > Thanks for those suggestions, Jarrod.  They all sound pretty
>> >>>> useful -
>> >>>> >>> > would you mind taking a crack at numbering them 1,2,3... etc,
>> in
>> >>>> the
>> >>>> >>> order
>> >>>> >>> > of priority as you see it?
>> >>>> >>> >
>> >>>> >>> > Also it seems like some of these could be applied to the Pivot
>> >>>> >>> function as
>> >>>> >>> > well, e.g., UDF for column naming.
>> >>>> >>> >
>> >>>> >>> > Frank
>> >>>> >>> >
>> >>>> >>> >
>> >>>> >>> >
>> >>>> >>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <
>> >>>> jvawdrey@pivotal.io>
>> >>>> >>> > wrote:
>> >>>> >>> >
>> >>>> >>> >> Hey Frank,
>> >>>> >>> >>
>> >>>> >>> >> How are special character values handled today? It is often
>> not
>> >>>> ideal
>> >>>> >>> to
>> >>>> >>> >> end up with column names that require double quotes to call
>> due
>> >>>> to
>> >>>> >>> >> downstream scripts.
>> >>>> >>> >>
>> >>>> >>> >> A couple of features that would be useful
>> >>>> >>> >>
>> >>>> >>> >> * Option to define resulting column names. Please see pdltools
>> >>>> >>> >> implementation - the ability to pass in a function is
>> especially
>> >>>> >>> useful (
>> >>>> >>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot0
>> >>>> 1.html)
>> >>>> >>> >> * Option to dummy code only the top n most frequently
>> occurring
>> >>>> >>> values in
>> >>>> >>> >> any column
>> >>>> >>> >> * Option to exclude original column from results table
>> >>>> >>> >> * Option to create numeric column names (E.g. pivotcol_val1,
>> >>>> >>> >> pivotcol_val2 ...) instead of values in column names +
>> secondary
>> >>>> >>> mapping
>> >>>> >>> >> table
>> >>>> >>> >>
>> >>>> >>> >> Thank you
>> >>>> >>> >>
>> >>>> >>> >> Jarrod Vawdrey
>> >>>> >>> >> Sr. Data Scientist
>> >>>> >>> >> Data Science & Engineering | Pivotal
>> >>>> >>> >> (650) 315-8905
>> >>>> >>> >> https://pivotal.io/
>> >>>> >>> >>
>> >>>> >>> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <
>> >>>> >>> fmcquillan@pivotal.io>
>> >>>> >>> >> wrote:
>> >>>> >>> >>
>> >>>> >>> >>> For the module encoding categorical variables
>> >>>> >>> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d
>> >>>> >>> >>> ata__prep.html
>> >>>> >>> >>> does anyone have any suggestions on improvements that we
>> could
>> >>>> make?
>> >>>> >>> >>>
>> >>>> >>> >>> Here is a video on how encoding categorical variables works
>> for
>> >>>> >>> those not
>> >>>> >>> >>> familiar with it
>> >>>> >>> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
>> >>>> >>> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
>> >>>> >>> >>>
>> >>>> >>> >>
>> >>>> >>> >>
>> >>>> >>> >
>> >>>> >>>
>> >>>> >>
>> >>>> >>
>> >>>> >
>> >>>>
>> >>>
>> >>>
>> >>
>> >
>>
>
>

Re: Encoding categorical variables

Posted by Frank McQuillan <fm...@pivotal.io>.

Yes thanks Vatsan we have been looking at that.

On Fri, Oct 28, 2016 at 2:39 PM, Srivatsan R <va...@gmail.com> wrote:

> You guys may have already seen this, but linking just in case:
> http://pandas.pydata.org/pandas-docs/stable/generated/
> pandas.get_dummies.html
>
> On Fri, Oct 28, 2016 at 1:32 PM, Woo Jae Jung <wj...@pivotal.io> wrote:
>
> > +Vatsan for his thoughts as well!
> >
> > On Fri, Oct 28, 2016 at 1:29 PM, Woo Jae Jung <wj...@pivotal.io> wrote:
> >
> >> Also agree that double-quoted column names are not ideal.  In addition
> to
> >> the net-new features described in this thread, it'd be nice to see
> >> non-double-quoted output as default behavior in the
> >> existing create_indicator_variables() function.
> >>
> >> Thanks,
> >> Woo
> >>
> >> On Fri, Oct 28, 2016 at 1:05 PM, Woo Jae Jung <wj...@pivotal.io> wrote:
> >>
> >>> I like the one-hot encoded feature.  Another variant of this idea would
> >>> be an "all other" variable (distinct from the reference class) that
> >>> contains occurrences of the less frequent category types.  In both of
> these
> >>> scenarios, the threshold for 'less frequent' could be user-supplied.
> >>>
> >>> Thanks,
> >>> Woo
> >>>
> >>> On Fri, Oct 28, 2016 at 11:29 AM, Rahul Iyer <ra...@gmail.com>
> >>> wrote:
> >>>
> >>>> An alternative to dropping is to assign the less frequent values to
> the
> >>>> reference i.e. all one-hot encoded features will be 0.
> >>>> Also important to note: total runtime will increase with this option
> >>>> since
> >>>> we'll have to compute the exact frequency distribution.
> >>>>
> >>>> Another suggested change is to call this function 'one_hot_encoding'
> >>>> since
> >>>> that is the output here (similar to sklearn's OneHotEncoder
> >>>> <http://scikit-learn.org/stable/modules/generated/sklearn.pr
> >>>> eprocessing.OneHotEncoder.html>).
> >>>> We can keep the current name as a deprecated alias till 2.0 is
> released.
> >>>>
> >>>> On Fri, Oct 28, 2016 at 11:17 AM, Frank McQuillan <
> >>>> fmcquillan@pivotal.io>
> >>>> wrote:
> >>>>
> >>>> > Jarrod,
> >>>> >
> >>>> > Just trying to write up detailed requirements.  How would you see
> >>>> this one
> >>>> > working?
> >>>> >
> >>>> > "2) Option to dummy code only the top n most frequently occurring
> >>>> values in
> >>>> > any column"
> >>>> >
> >>>> > With 1 column I can picture it, you would drop the rows with the
> less
> >>>> > frequently occurring values and end up with a smaller table.  But
> >>>> what if
> >>>> > you are encoding multiple rows?    Would you want a per row
> >>>> specification
> >>>> > of n? i.e., top 3 values for column x, top 10 values for column y?
> >>>> If you
> >>>> > did this then your result set might include low frequency values for
> >>>> column
> >>>> > x (not in top 3) because they are in the top 10 for column y - this
> >>>> might
> >>>> > be confusing.
> >>>> >
> >>>> > Frank
> >>>> >
> >>>> > On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan <
> >>>> fmcquillan@pivotal.io>
> >>>> > wrote:
> >>>> >
> >>>> >> great, thanks for the additional information
> >>>> >>
> >>>> >> Frank
> >>>> >>
> >>>> >> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey <
> jvawdrey@pivotal.io
> >>>> >
> >>>> >> wrote:
> >>>> >>
> >>>> >>> IMO
> >>>> >>>
> >>>> >>> 1) Option to define resulting column names. Please see pdltools
> >>>> >>> implementation - the ability to pass in a function is especially
> >>>> useful (
> >>>> >>> http://pivotalsoftware.github.io/PDLTools/group__grp__
> pivot01.html)
> >>>> >>> 2) Option to dummy code only the top n most frequently occurring
> >>>> values
> >>>> >>> in
> >>>> >>> any column
> >>>> >>> 3) Option to create numeric column names (E.g. pivotcol_val1,
> >>>> >>> pivotcol_val2
> >>>> >>> ...) instead of values in column names + secondary mapping table
> >>>> >>> 4) Option to exclude original column from results table
> >>>> >>>
> >>>> >>> (1) & (2) are much higher priority than (3) & (4).
> >>>> >>>
> >>>> >>> Agreed that these could also be applied to Pivoting (especially
> 1).
> >>>> >>>
> >>>> >>>
> >>>> >>>
> >>>> >>> Jarrod Vawdrey
> >>>> >>> Sr. Data Scientist
> >>>> >>> Data Science & Engineering | Pivotal
> >>>> >>> (650) 315-8905
> >>>> >>> https://pivotal.io/
> >>>> >>>
> >>>> >>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <
> >>>> fmcquillan@pivotal.io>
> >>>> >>> wrote:
> >>>> >>>
> >>>> >>> > Thanks for those suggestions, Jarrod.  They all sound pretty
> >>>> useful -
> >>>> >>> > would you mind taking a crack at numbering them 1,2,3... etc, in
> >>>> the
> >>>> >>> order
> >>>> >>> > of priority as you see it?
> >>>> >>> >
> >>>> >>> > Also it seems like some of these could be applied to the Pivot
> >>>> >>> function as
> >>>> >>> > well, e.g., UDF for column naming.
> >>>> >>> >
> >>>> >>> > Frank
> >>>> >>> >
> >>>> >>> >
> >>>> >>> >
> >>>> >>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <
> >>>> jvawdrey@pivotal.io>
> >>>> >>> > wrote:
> >>>> >>> >
> >>>> >>> >> Hey Frank,
> >>>> >>> >>
> >>>> >>> >> How are special character values handled today? It is often not
> >>>> ideal
> >>>> >>> to
> >>>> >>> >> end up with column names that require double quotes to call due
> >>>> to
> >>>> >>> >> downstream scripts.
> >>>> >>> >>
> >>>> >>> >> A couple of features that would be useful
> >>>> >>> >>
> >>>> >>> >> * Option to define resulting column names. Please see pdltools
> >>>> >>> >> implementation - the ability to pass in a function is
> especially
> >>>> >>> useful (
> >>>> >>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot0
> >>>> 1.html)
> >>>> >>> >> * Option to dummy code only the top n most frequently occurring
> >>>> >>> values in
> >>>> >>> >> any column
> >>>> >>> >> * Option to exclude original column from results table
> >>>> >>> >> * Option to create numeric column names (E.g. pivotcol_val1,
> >>>> >>> >> pivotcol_val2 ...) instead of values in column names +
> secondary
> >>>> >>> mapping
> >>>> >>> >> table
> >>>> >>> >>
> >>>> >>> >> Thank you
> >>>> >>> >>
> >>>> >>> >> Jarrod Vawdrey
> >>>> >>> >> Sr. Data Scientist
> >>>> >>> >> Data Science & Engineering | Pivotal
> >>>> >>> >> (650) 315-8905
> >>>> >>> >> https://pivotal.io/
> >>>> >>> >>
> >>>> >>> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <
> >>>> >>> fmcquillan@pivotal.io>
> >>>> >>> >> wrote:
> >>>> >>> >>
> >>>> >>> >>> For the module encoding categorical variables
> >>>> >>> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d
> >>>> >>> >>> ata__prep.html
> >>>> >>> >>> does anyone have any suggestions on improvements that we could
> >>>> make?
> >>>> >>> >>>
> >>>> >>> >>> Here is a video on how encoding categorical variables works
> for
> >>>> >>> those not
> >>>> >>> >>> familiar with it
> >>>> >>> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
> >>>> >>> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
> >>>> >>> >>>
> >>>> >>> >>
> >>>> >>> >>
> >>>> >>> >
> >>>> >>>
> >>>> >>
> >>>> >>
> >>>> >
> >>>>
> >>>
> >>>
> >>
> >
>

Re: Encoding categorical variables

Posted by Frank McQuillan <fm...@pivotal.io>.

Yes thanks Vatsan we have been looking at that.

On Fri, Oct 28, 2016 at 2:39 PM, Srivatsan R <va...@gmail.com> wrote:

> You guys may have already seen this, but linking just in case:
> http://pandas.pydata.org/pandas-docs/stable/generated/
> pandas.get_dummies.html
>
> On Fri, Oct 28, 2016 at 1:32 PM, Woo Jae Jung <wj...@pivotal.io> wrote:
>
> > +Vatsan for his thoughts as well!
> >
> > On Fri, Oct 28, 2016 at 1:29 PM, Woo Jae Jung <wj...@pivotal.io> wrote:
> >
> >> Also agree that double-quoted column names are not ideal.  In addition
> to
> >> the net-new features described in this thread, it'd be nice to see
> >> non-double-quoted output as default behavior in the
> >> existing create_indicator_variables() function.
> >>
> >> Thanks,
> >> Woo
> >>
> >> On Fri, Oct 28, 2016 at 1:05 PM, Woo Jae Jung <wj...@pivotal.io> wrote:
> >>
> >>> I like the one-hot encoded feature.  Another variant of this idea would
> >>> be an "all other" variable (distinct from the reference class) that
> >>> contains occurrences of the less frequent category types.  In both of
> these
> >>> scenarios, the threshold for 'less frequent' could be user-supplied.
> >>>
> >>> Thanks,
> >>> Woo
> >>>
> >>> On Fri, Oct 28, 2016 at 11:29 AM, Rahul Iyer <ra...@gmail.com>
> >>> wrote:
> >>>
> >>>> An alternative to dropping is to assign the less frequent values to
> the
> >>>> reference i.e. all one-hot encoded features will be 0.
> >>>> Also important to note: total runtime will increase with this option
> >>>> since
> >>>> we'll have to compute the exact frequency distribution.
> >>>>
> >>>> Another suggested change is to call this function 'one_hot_encoding'
> >>>> since
> >>>> that is the output here (similar to sklearn's OneHotEncoder
> >>>> <http://scikit-learn.org/stable/modules/generated/sklearn.pr
> >>>> eprocessing.OneHotEncoder.html>).
> >>>> We can keep the current name as a deprecated alias till 2.0 is
> released.
> >>>>
> >>>> On Fri, Oct 28, 2016 at 11:17 AM, Frank McQuillan <
> >>>> fmcquillan@pivotal.io>
> >>>> wrote:
> >>>>
> >>>> > Jarrod,
> >>>> >
> >>>> > Just trying to write up detailed requirements.  How would you see
> >>>> this one
> >>>> > working?
> >>>> >
> >>>> > "2) Option to dummy code only the top n most frequently occurring
> >>>> values in
> >>>> > any column"
> >>>> >
> >>>> > With 1 column I can picture it, you would drop the rows with the
> less
> >>>> > frequently occurring values and end up with a smaller table.  But
> >>>> what if
> >>>> > you are encoding multiple rows?    Would you want a per row
> >>>> specification
> >>>> > of n? i.e., top 3 values for column x, top 10 values for column y?
> >>>> If you
> >>>> > did this then your result set might include low frequency values for
> >>>> column
> >>>> > x (not in top 3) because they are in the top 10 for column y - this
> >>>> might
> >>>> > be confusing.
> >>>> >
> >>>> > Frank
> >>>> >
> >>>> > On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan <
> >>>> fmcquillan@pivotal.io>
> >>>> > wrote:
> >>>> >
> >>>> >> great, thanks for the additional information
> >>>> >>
> >>>> >> Frank
> >>>> >>
> >>>> >> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey <
> jvawdrey@pivotal.io
> >>>> >
> >>>> >> wrote:
> >>>> >>
> >>>> >>> IMO
> >>>> >>>
> >>>> >>> 1) Option to define resulting column names. Please see pdltools
> >>>> >>> implementation - the ability to pass in a function is especially
> >>>> useful (
> >>>> >>> http://pivotalsoftware.github.io/PDLTools/group__grp__
> pivot01.html)
> >>>> >>> 2) Option to dummy code only the top n most frequently occurring
> >>>> values
> >>>> >>> in
> >>>> >>> any column
> >>>> >>> 3) Option to create numeric column names (E.g. pivotcol_val1,
> >>>> >>> pivotcol_val2
> >>>> >>> ...) instead of values in column names + secondary mapping table
> >>>> >>> 4) Option to exclude original column from results table
> >>>> >>>
> >>>> >>> (1) & (2) are much higher priority than (3) & (4).
> >>>> >>>
> >>>> >>> Agreed that these could also be applied to Pivoting (especially
> 1).
> >>>> >>>
> >>>> >>>
> >>>> >>>
> >>>> >>> Jarrod Vawdrey
> >>>> >>> Sr. Data Scientist
> >>>> >>> Data Science & Engineering | Pivotal
> >>>> >>> (650) 315-8905
> >>>> >>> https://pivotal.io/
> >>>> >>>
> >>>> >>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <
> >>>> fmcquillan@pivotal.io>
> >>>> >>> wrote:
> >>>> >>>
> >>>> >>> > Thanks for those suggestions, Jarrod.  They all sound pretty
> >>>> useful -
> >>>> >>> > would you mind taking a crack at numbering them 1,2,3... etc, in
> >>>> the
> >>>> >>> order
> >>>> >>> > of priority as you see it?
> >>>> >>> >
> >>>> >>> > Also it seems like some of these could be applied to the Pivot
> >>>> >>> function as
> >>>> >>> > well, e.g., UDF for column naming.
> >>>> >>> >
> >>>> >>> > Frank
> >>>> >>> >
> >>>> >>> >
> >>>> >>> >
> >>>> >>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <
> >>>> jvawdrey@pivotal.io>
> >>>> >>> > wrote:
> >>>> >>> >
> >>>> >>> >> Hey Frank,
> >>>> >>> >>
> >>>> >>> >> How are special character values handled today? It is often not
> >>>> ideal
> >>>> >>> to
> >>>> >>> >> end up with column names that require double quotes to call due
> >>>> to
> >>>> >>> >> downstream scripts.
> >>>> >>> >>
> >>>> >>> >> A couple of features that would be useful
> >>>> >>> >>
> >>>> >>> >> * Option to define resulting column names. Please see pdltools
> >>>> >>> >> implementation - the ability to pass in a function is
> especially
> >>>> >>> useful (
> >>>> >>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot0
> >>>> 1.html)
> >>>> >>> >> * Option to dummy code only the top n most frequently occurring
> >>>> >>> values in
> >>>> >>> >> any column
> >>>> >>> >> * Option to exclude original column from results table
> >>>> >>> >> * Option to create numeric column names (E.g. pivotcol_val1,
> >>>> >>> >> pivotcol_val2 ...) instead of values in column names +
> secondary
> >>>> >>> mapping
> >>>> >>> >> table
> >>>> >>> >>
> >>>> >>> >> Thank you
> >>>> >>> >>
> >>>> >>> >> Jarrod Vawdrey
> >>>> >>> >> Sr. Data Scientist
> >>>> >>> >> Data Science & Engineering | Pivotal
> >>>> >>> >> (650) 315-8905
> >>>> >>> >> https://pivotal.io/
> >>>> >>> >>
> >>>> >>> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <
> >>>> >>> fmcquillan@pivotal.io>
> >>>> >>> >> wrote:
> >>>> >>> >>
> >>>> >>> >>> For the module encoding categorical variables
> >>>> >>> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d
> >>>> >>> >>> ata__prep.html
> >>>> >>> >>> does anyone have any suggestions on improvements that we could
> >>>> make?
> >>>> >>> >>>
> >>>> >>> >>> Here is a video on how encoding categorical variables works
> for
> >>>> >>> those not
> >>>> >>> >>> familiar with it
> >>>> >>> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
> >>>> >>> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
> >>>> >>> >>>
> >>>> >>> >>
> >>>> >>> >>
> >>>> >>> >
> >>>> >>>
> >>>> >>
> >>>> >>
> >>>> >
> >>>>
> >>>
> >>>
> >>
> >
>

Re: Encoding categorical variables

Posted by Srivatsan R <va...@gmail.com>.

You guys may have already seen this, but linking just in case:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html

On Fri, Oct 28, 2016 at 1:32 PM, Woo Jae Jung <wj...@pivotal.io> wrote:

> +Vatsan for his thoughts as well!
>
> On Fri, Oct 28, 2016 at 1:29 PM, Woo Jae Jung <wj...@pivotal.io> wrote:
>
>> Also agree that double-quoted column names are not ideal.  In addition to
>> the net-new features described in this thread, it'd be nice to see
>> non-double-quoted output as default behavior in the
>> existing create_indicator_variables() function.
>>
>> Thanks,
>> Woo
>>
>> On Fri, Oct 28, 2016 at 1:05 PM, Woo Jae Jung <wj...@pivotal.io> wrote:
>>
>>> I like the one-hot encoded feature.  Another variant of this idea would
>>> be an "all other" variable (distinct from the reference class) that
>>> contains occurrences of the less frequent category types.  In both of these
>>> scenarios, the threshold for 'less frequent' could be user-supplied.
>>>
>>> Thanks,
>>> Woo
>>>
>>> On Fri, Oct 28, 2016 at 11:29 AM, Rahul Iyer <ra...@gmail.com>
>>> wrote:
>>>
>>>> An alternative to dropping is to assign the less frequent values to the
>>>> reference i.e. all one-hot encoded features will be 0.
>>>> Also important to note: total runtime will increase with this option
>>>> since
>>>> we'll have to compute the exact frequency distribution.
>>>>
>>>> Another suggested change is to call this function 'one_hot_encoding'
>>>> since
>>>> that is the output here (similar to sklearn's OneHotEncoder
>>>> <http://scikit-learn.org/stable/modules/generated/sklearn.pr
>>>> eprocessing.OneHotEncoder.html>).
>>>> We can keep the current name as a deprecated alias till 2.0 is released.
>>>>
>>>> On Fri, Oct 28, 2016 at 11:17 AM, Frank McQuillan <
>>>> fmcquillan@pivotal.io>
>>>> wrote:
>>>>
>>>> > Jarrod,
>>>> >
>>>> > Just trying to write up detailed requirements.  How would you see
>>>> this one
>>>> > working?
>>>> >
>>>> > "2) Option to dummy code only the top n most frequently occurring
>>>> values in
>>>> > any column"
>>>> >
>>>> > With 1 column I can picture it, you would drop the rows with the less
>>>> > frequently occurring values and end up with a smaller table.  But
>>>> what if
>>>> > you are encoding multiple rows?    Would you want a per row
>>>> specification
>>>> > of n? i.e., top 3 values for column x, top 10 values for column y?
>>>> If you
>>>> > did this then your result set might include low frequency values for
>>>> column
>>>> > x (not in top 3) because they are in the top 10 for column y - this
>>>> might
>>>> > be confusing.
>>>> >
>>>> > Frank
>>>> >
>>>> > On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan <
>>>> fmcquillan@pivotal.io>
>>>> > wrote:
>>>> >
>>>> >> great, thanks for the additional information
>>>> >>
>>>> >> Frank
>>>> >>
>>>> >> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey <jvawdrey@pivotal.io
>>>> >
>>>> >> wrote:
>>>> >>
>>>> >>> IMO
>>>> >>>
>>>> >>> 1) Option to define resulting column names. Please see pdltools
>>>> >>> implementation - the ability to pass in a function is especially
>>>> useful (
>>>> >>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
>>>> >>> 2) Option to dummy code only the top n most frequently occurring
>>>> values
>>>> >>> in
>>>> >>> any column
>>>> >>> 3) Option to create numeric column names (E.g. pivotcol_val1,
>>>> >>> pivotcol_val2
>>>> >>> ...) instead of values in column names + secondary mapping table
>>>> >>> 4) Option to exclude original column from results table
>>>> >>>
>>>> >>> (1) & (2) are much higher priority than (3) & (4).
>>>> >>>
>>>> >>> Agreed that these could also be applied to Pivoting (especially 1).
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> Jarrod Vawdrey
>>>> >>> Sr. Data Scientist
>>>> >>> Data Science & Engineering | Pivotal
>>>> >>> (650) 315-8905
>>>> >>> https://pivotal.io/
>>>> >>>
>>>> >>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <
>>>> fmcquillan@pivotal.io>
>>>> >>> wrote:
>>>> >>>
>>>> >>> > Thanks for those suggestions, Jarrod.  They all sound pretty
>>>> useful -
>>>> >>> > would you mind taking a crack at numbering them 1,2,3... etc, in
>>>> the
>>>> >>> order
>>>> >>> > of priority as you see it?
>>>> >>> >
>>>> >>> > Also it seems like some of these could be applied to the Pivot
>>>> >>> function as
>>>> >>> > well, e.g., UDF for column naming.
>>>> >>> >
>>>> >>> > Frank
>>>> >>> >
>>>> >>> >
>>>> >>> >
>>>> >>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <
>>>> jvawdrey@pivotal.io>
>>>> >>> > wrote:
>>>> >>> >
>>>> >>> >> Hey Frank,
>>>> >>> >>
>>>> >>> >> How are special character values handled today? It is often not
>>>> ideal
>>>> >>> to
>>>> >>> >> end up with column names that require double quotes to call due
>>>> to
>>>> >>> >> downstream scripts.
>>>> >>> >>
>>>> >>> >> A couple of features that would be useful
>>>> >>> >>
>>>> >>> >> * Option to define resulting column names. Please see pdltools
>>>> >>> >> implementation - the ability to pass in a function is especially
>>>> >>> useful (
>>>> >>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot0
>>>> 1.html)
>>>> >>> >> * Option to dummy code only the top n most frequently occurring
>>>> >>> values in
>>>> >>> >> any column
>>>> >>> >> * Option to exclude original column from results table
>>>> >>> >> * Option to create numeric column names (E.g. pivotcol_val1,
>>>> >>> >> pivotcol_val2 ...) instead of values in column names + secondary
>>>> >>> mapping
>>>> >>> >> table
>>>> >>> >>
>>>> >>> >> Thank you
>>>> >>> >>
>>>> >>> >> Jarrod Vawdrey
>>>> >>> >> Sr. Data Scientist
>>>> >>> >> Data Science & Engineering | Pivotal
>>>> >>> >> (650) 315-8905
>>>> >>> >> https://pivotal.io/
>>>> >>> >>
>>>> >>> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <
>>>> >>> fmcquillan@pivotal.io>
>>>> >>> >> wrote:
>>>> >>> >>
>>>> >>> >>> For the module encoding categorical variables
>>>> >>> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d
>>>> >>> >>> ata__prep.html
>>>> >>> >>> does anyone have any suggestions on improvements that we could
>>>> make?
>>>> >>> >>>
>>>> >>> >>> Here is a video on how encoding categorical variables works for
>>>> >>> those not
>>>> >>> >>> familiar with it
>>>> >>> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
>>>> >>> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
>>>> >>> >>>
>>>> >>> >>
>>>> >>> >>
>>>> >>> >
>>>> >>>
>>>> >>
>>>> >>
>>>> >
>>>>
>>>
>>>
>>
>

Re: Encoding categorical variables

Posted by Woo Jae Jung <wj...@pivotal.io>.

+Vatsan for his thoughts as well!

On Fri, Oct 28, 2016 at 1:29 PM, Woo Jae Jung <wj...@pivotal.io> wrote:

> Also agree that double-quoted column names are not ideal.  In addition to
> the net-new features described in this thread, it'd be nice to see
> non-double-quoted output as default behavior in the
> existing create_indicator_variables() function.
>
> Thanks,
> Woo
>
> On Fri, Oct 28, 2016 at 1:05 PM, Woo Jae Jung <wj...@pivotal.io> wrote:
>
>> I like the one-hot encoded feature.  Another variant of this idea would
>> be an "all other" variable (distinct from the reference class) that
>> contains occurrences of the less frequent category types.  In both of these
>> scenarios, the threshold for 'less frequent' could be user-supplied.
>>
>> Thanks,
>> Woo
>>
>> On Fri, Oct 28, 2016 at 11:29 AM, Rahul Iyer <ra...@gmail.com>
>> wrote:
>>
>>> An alternative to dropping is to assign the less frequent values to the
>>> reference i.e. all one-hot encoded features will be 0.
>>> Also important to note: total runtime will increase with this option
>>> since
>>> we'll have to compute the exact frequency distribution.
>>>
>>> Another suggested change is to call this function 'one_hot_encoding'
>>> since
>>> that is the output here (similar to sklearn's OneHotEncoder
>>> <http://scikit-learn.org/stable/modules/generated/sklearn.pr
>>> eprocessing.OneHotEncoder.html>).
>>> We can keep the current name as a deprecated alias till 2.0 is released.
>>>
>>> On Fri, Oct 28, 2016 at 11:17 AM, Frank McQuillan <fmcquillan@pivotal.io
>>> >
>>> wrote:
>>>
>>> > Jarrod,
>>> >
>>> > Just trying to write up detailed requirements.  How would you see this
>>> one
>>> > working?
>>> >
>>> > "2) Option to dummy code only the top n most frequently occurring
>>> values in
>>> > any column"
>>> >
>>> > With 1 column I can picture it, you would drop the rows with the less
>>> > frequently occurring values and end up with a smaller table.  But what
>>> if
>>> > you are encoding multiple rows?    Would you want a per row
>>> specification
>>> > of n? i.e., top 3 values for column x, top 10 values for column y?  If
>>> you
>>> > did this then your result set might include low frequency values for
>>> column
>>> > x (not in top 3) because they are in the top 10 for column y - this
>>> might
>>> > be confusing.
>>> >
>>> > Frank
>>> >
>>> > On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan <
>>> fmcquillan@pivotal.io>
>>> > wrote:
>>> >
>>> >> great, thanks for the additional information
>>> >>
>>> >> Frank
>>> >>
>>> >> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey <jv...@pivotal.io>
>>> >> wrote:
>>> >>
>>> >>> IMO
>>> >>>
>>> >>> 1) Option to define resulting column names. Please see pdltools
>>> >>> implementation - the ability to pass in a function is especially
>>> useful (
>>> >>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
>>> >>> 2) Option to dummy code only the top n most frequently occurring
>>> values
>>> >>> in
>>> >>> any column
>>> >>> 3) Option to create numeric column names (E.g. pivotcol_val1,
>>> >>> pivotcol_val2
>>> >>> ...) instead of values in column names + secondary mapping table
>>> >>> 4) Option to exclude original column from results table
>>> >>>
>>> >>> (1) & (2) are much higher priority than (3) & (4).
>>> >>>
>>> >>> Agreed that these could also be applied to Pivoting (especially 1).
>>> >>>
>>> >>>
>>> >>>
>>> >>> Jarrod Vawdrey
>>> >>> Sr. Data Scientist
>>> >>> Data Science & Engineering | Pivotal
>>> >>> (650) 315-8905
>>> >>> https://pivotal.io/
>>> >>>
>>> >>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <
>>> fmcquillan@pivotal.io>
>>> >>> wrote:
>>> >>>
>>> >>> > Thanks for those suggestions, Jarrod.  They all sound pretty
>>> useful -
>>> >>> > would you mind taking a crack at numbering them 1,2,3... etc, in
>>> the
>>> >>> order
>>> >>> > of priority as you see it?
>>> >>> >
>>> >>> > Also it seems like some of these could be applied to the Pivot
>>> >>> function as
>>> >>> > well, e.g., UDF for column naming.
>>> >>> >
>>> >>> > Frank
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <
>>> jvawdrey@pivotal.io>
>>> >>> > wrote:
>>> >>> >
>>> >>> >> Hey Frank,
>>> >>> >>
>>> >>> >> How are special character values handled today? It is often not
>>> ideal
>>> >>> to
>>> >>> >> end up with column names that require double quotes to call due to
>>> >>> >> downstream scripts.
>>> >>> >>
>>> >>> >> A couple of features that would be useful
>>> >>> >>
>>> >>> >> * Option to define resulting column names. Please see pdltools
>>> >>> >> implementation - the ability to pass in a function is especially
>>> >>> useful (
>>> >>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot0
>>> 1.html)
>>> >>> >> * Option to dummy code only the top n most frequently occurring
>>> >>> values in
>>> >>> >> any column
>>> >>> >> * Option to exclude original column from results table
>>> >>> >> * Option to create numeric column names (E.g. pivotcol_val1,
>>> >>> >> pivotcol_val2 ...) instead of values in column names + secondary
>>> >>> mapping
>>> >>> >> table
>>> >>> >>
>>> >>> >> Thank you
>>> >>> >>
>>> >>> >> Jarrod Vawdrey
>>> >>> >> Sr. Data Scientist
>>> >>> >> Data Science & Engineering | Pivotal
>>> >>> >> (650) 315-8905
>>> >>> >> https://pivotal.io/
>>> >>> >>
>>> >>> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <
>>> >>> fmcquillan@pivotal.io>
>>> >>> >> wrote:
>>> >>> >>
>>> >>> >>> For the module encoding categorical variables
>>> >>> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d
>>> >>> >>> ata__prep.html
>>> >>> >>> does anyone have any suggestions on improvements that we could
>>> make?
>>> >>> >>>
>>> >>> >>> Here is a video on how encoding categorical variables works for
>>> >>> those not
>>> >>> >>> familiar with it
>>> >>> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
>>> >>> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
>>> >>> >>>
>>> >>> >>
>>> >>> >>
>>> >>> >
>>> >>>
>>> >>
>>> >>
>>> >
>>>
>>
>>
>

Re: Encoding categorical variables

Posted by Woo Jae Jung <wj...@pivotal.io>.

+Vatsan for his thoughts as well!

On Fri, Oct 28, 2016 at 1:29 PM, Woo Jae Jung <wj...@pivotal.io> wrote:

> Also agree that double-quoted column names are not ideal.  In addition to
> the net-new features described in this thread, it'd be nice to see
> non-double-quoted output as default behavior in the
> existing create_indicator_variables() function.
>
> Thanks,
> Woo
>
> On Fri, Oct 28, 2016 at 1:05 PM, Woo Jae Jung <wj...@pivotal.io> wrote:
>
>> I like the one-hot encoded feature.  Another variant of this idea would
>> be an "all other" variable (distinct from the reference class) that
>> contains occurrences of the less frequent category types.  In both of these
>> scenarios, the threshold for 'less frequent' could be user-supplied.
>>
>> Thanks,
>> Woo
>>
>> On Fri, Oct 28, 2016 at 11:29 AM, Rahul Iyer <ra...@gmail.com>
>> wrote:
>>
>>> An alternative to dropping is to assign the less frequent values to the
>>> reference i.e. all one-hot encoded features will be 0.
>>> Also important to note: total runtime will increase with this option
>>> since
>>> we'll have to compute the exact frequency distribution.
>>>
>>> Another suggested change is to call this function 'one_hot_encoding'
>>> since
>>> that is the output here (similar to sklearn's OneHotEncoder
>>> <http://scikit-learn.org/stable/modules/generated/sklearn.pr
>>> eprocessing.OneHotEncoder.html>).
>>> We can keep the current name as a deprecated alias till 2.0 is released.
>>>
>>> On Fri, Oct 28, 2016 at 11:17 AM, Frank McQuillan <fmcquillan@pivotal.io
>>> >
>>> wrote:
>>>
>>> > Jarrod,
>>> >
>>> > Just trying to write up detailed requirements.  How would you see this
>>> one
>>> > working?
>>> >
>>> > "2) Option to dummy code only the top n most frequently occurring
>>> values in
>>> > any column"
>>> >
>>> > With 1 column I can picture it, you would drop the rows with the less
>>> > frequently occurring values and end up with a smaller table.  But what
>>> if
>>> > you are encoding multiple rows?    Would you want a per row
>>> specification
>>> > of n? i.e., top 3 values for column x, top 10 values for column y?  If
>>> you
>>> > did this then your result set might include low frequency values for
>>> column
>>> > x (not in top 3) because they are in the top 10 for column y - this
>>> might
>>> > be confusing.
>>> >
>>> > Frank
>>> >
>>> > On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan <
>>> fmcquillan@pivotal.io>
>>> > wrote:
>>> >
>>> >> great, thanks for the additional information
>>> >>
>>> >> Frank
>>> >>
>>> >> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey <jv...@pivotal.io>
>>> >> wrote:
>>> >>
>>> >>> IMO
>>> >>>
>>> >>> 1) Option to define resulting column names. Please see pdltools
>>> >>> implementation - the ability to pass in a function is especially
>>> useful (
>>> >>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
>>> >>> 2) Option to dummy code only the top n most frequently occurring
>>> values
>>> >>> in
>>> >>> any column
>>> >>> 3) Option to create numeric column names (E.g. pivotcol_val1,
>>> >>> pivotcol_val2
>>> >>> ...) instead of values in column names + secondary mapping table
>>> >>> 4) Option to exclude original column from results table
>>> >>>
>>> >>> (1) & (2) are much higher priority than (3) & (4).
>>> >>>
>>> >>> Agreed that these could also be applied to Pivoting (especially 1).
>>> >>>
>>> >>>
>>> >>>
>>> >>> Jarrod Vawdrey
>>> >>> Sr. Data Scientist
>>> >>> Data Science & Engineering | Pivotal
>>> >>> (650) 315-8905
>>> >>> https://pivotal.io/
>>> >>>
>>> >>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <
>>> fmcquillan@pivotal.io>
>>> >>> wrote:
>>> >>>
>>> >>> > Thanks for those suggestions, Jarrod.  They all sound pretty
>>> useful -
>>> >>> > would you mind taking a crack at numbering them 1,2,3... etc, in
>>> the
>>> >>> order
>>> >>> > of priority as you see it?
>>> >>> >
>>> >>> > Also it seems like some of these could be applied to the Pivot
>>> >>> function as
>>> >>> > well, e.g., UDF for column naming.
>>> >>> >
>>> >>> > Frank
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <
>>> jvawdrey@pivotal.io>
>>> >>> > wrote:
>>> >>> >
>>> >>> >> Hey Frank,
>>> >>> >>
>>> >>> >> How are special character values handled today? It is often not
>>> ideal
>>> >>> to
>>> >>> >> end up with column names that require double quotes to call due to
>>> >>> >> downstream scripts.
>>> >>> >>
>>> >>> >> A couple of features that would be useful
>>> >>> >>
>>> >>> >> * Option to define resulting column names. Please see pdltools
>>> >>> >> implementation - the ability to pass in a function is especially
>>> >>> useful (
>>> >>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot0
>>> 1.html)
>>> >>> >> * Option to dummy code only the top n most frequently occurring
>>> >>> values in
>>> >>> >> any column
>>> >>> >> * Option to exclude original column from results table
>>> >>> >> * Option to create numeric column names (E.g. pivotcol_val1,
>>> >>> >> pivotcol_val2 ...) instead of values in column names + secondary
>>> >>> mapping
>>> >>> >> table
>>> >>> >>
>>> >>> >> Thank you
>>> >>> >>
>>> >>> >> Jarrod Vawdrey
>>> >>> >> Sr. Data Scientist
>>> >>> >> Data Science & Engineering | Pivotal
>>> >>> >> (650) 315-8905
>>> >>> >> https://pivotal.io/
>>> >>> >>
>>> >>> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <
>>> >>> fmcquillan@pivotal.io>
>>> >>> >> wrote:
>>> >>> >>
>>> >>> >>> For the module encoding categorical variables
>>> >>> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d
>>> >>> >>> ata__prep.html
>>> >>> >>> does anyone have any suggestions on improvements that we could
>>> make?
>>> >>> >>>
>>> >>> >>> Here is a video on how encoding categorical variables works for
>>> >>> those not
>>> >>> >>> familiar with it
>>> >>> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
>>> >>> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
>>> >>> >>>
>>> >>> >>
>>> >>> >>
>>> >>> >
>>> >>>
>>> >>
>>> >>
>>> >
>>>
>>
>>
>

Re: Encoding categorical variables

Posted by Woo Jae Jung <wj...@pivotal.io>.

Also agree that double-quoted column names are not ideal.  In addition to
the net-new features described in this thread, it'd be nice to see
non-double-quoted output as default behavior in the
existing create_indicator_variables() function.

Thanks,
Woo

On Fri, Oct 28, 2016 at 1:05 PM, Woo Jae Jung <wj...@pivotal.io> wrote:

> I like the one-hot encoded feature.  Another variant of this idea would be
> an "all other" variable (distinct from the reference class) that contains
> occurrences of the less frequent category types.  In both of these
> scenarios, the threshold for 'less frequent' could be user-supplied.
>
> Thanks,
> Woo
>
> On Fri, Oct 28, 2016 at 11:29 AM, Rahul Iyer <ra...@gmail.com> wrote:
>
>> An alternative to dropping is to assign the less frequent values to the
>> reference i.e. all one-hot encoded features will be 0.
>> Also important to note: total runtime will increase with this option since
>> we'll have to compute the exact frequency distribution.
>>
>> Another suggested change is to call this function 'one_hot_encoding' since
>> that is the output here (similar to sklearn's OneHotEncoder
>> <http://scikit-learn.org/stable/modules/generated/sklearn.
>> preprocessing.OneHotEncoder.html>).
>> We can keep the current name as a deprecated alias till 2.0 is released.
>>
>> On Fri, Oct 28, 2016 at 11:17 AM, Frank McQuillan <fm...@pivotal.io>
>> wrote:
>>
>> > Jarrod,
>> >
>> > Just trying to write up detailed requirements.  How would you see this
>> one
>> > working?
>> >
>> > "2) Option to dummy code only the top n most frequently occurring
>> values in
>> > any column"
>> >
>> > With 1 column I can picture it, you would drop the rows with the less
>> > frequently occurring values and end up with a smaller table.  But what
>> if
>> > you are encoding multiple rows?    Would you want a per row
>> specification
>> > of n? i.e., top 3 values for column x, top 10 values for column y?  If
>> you
>> > did this then your result set might include low frequency values for
>> column
>> > x (not in top 3) because they are in the top 10 for column y - this
>> might
>> > be confusing.
>> >
>> > Frank
>> >
>> > On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan <fmcquillan@pivotal.io
>> >
>> > wrote:
>> >
>> >> great, thanks for the additional information
>> >>
>> >> Frank
>> >>
>> >> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey <jv...@pivotal.io>
>> >> wrote:
>> >>
>> >>> IMO
>> >>>
>> >>> 1) Option to define resulting column names. Please see pdltools
>> >>> implementation - the ability to pass in a function is especially
>> useful (
>> >>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
>> >>> 2) Option to dummy code only the top n most frequently occurring
>> values
>> >>> in
>> >>> any column
>> >>> 3) Option to create numeric column names (E.g. pivotcol_val1,
>> >>> pivotcol_val2
>> >>> ...) instead of values in column names + secondary mapping table
>> >>> 4) Option to exclude original column from results table
>> >>>
>> >>> (1) & (2) are much higher priority than (3) & (4).
>> >>>
>> >>> Agreed that these could also be applied to Pivoting (especially 1).
>> >>>
>> >>>
>> >>>
>> >>> Jarrod Vawdrey
>> >>> Sr. Data Scientist
>> >>> Data Science & Engineering | Pivotal
>> >>> (650) 315-8905
>> >>> https://pivotal.io/
>> >>>
>> >>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <
>> fmcquillan@pivotal.io>
>> >>> wrote:
>> >>>
>> >>> > Thanks for those suggestions, Jarrod.  They all sound pretty useful
>> -
>> >>> > would you mind taking a crack at numbering them 1,2,3... etc, in the
>> >>> order
>> >>> > of priority as you see it?
>> >>> >
>> >>> > Also it seems like some of these could be applied to the Pivot
>> >>> function as
>> >>> > well, e.g., UDF for column naming.
>> >>> >
>> >>> > Frank
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <
>> jvawdrey@pivotal.io>
>> >>> > wrote:
>> >>> >
>> >>> >> Hey Frank,
>> >>> >>
>> >>> >> How are special character values handled today? It is often not
>> ideal
>> >>> to
>> >>> >> end up with column names that require double quotes to call due to
>> >>> >> downstream scripts.
>> >>> >>
>> >>> >> A couple of features that would be useful
>> >>> >>
>> >>> >> * Option to define resulting column names. Please see pdltools
>> >>> >> implementation - the ability to pass in a function is especially
>> >>> useful (
>> >>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html
>> )
>> >>> >> * Option to dummy code only the top n most frequently occurring
>> >>> values in
>> >>> >> any column
>> >>> >> * Option to exclude original column from results table
>> >>> >> * Option to create numeric column names (E.g. pivotcol_val1,
>> >>> >> pivotcol_val2 ...) instead of values in column names + secondary
>> >>> mapping
>> >>> >> table
>> >>> >>
>> >>> >> Thank you
>> >>> >>
>> >>> >> Jarrod Vawdrey
>> >>> >> Sr. Data Scientist
>> >>> >> Data Science & Engineering | Pivotal
>> >>> >> (650) 315-8905
>> >>> >> https://pivotal.io/
>> >>> >>
>> >>> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <
>> >>> fmcquillan@pivotal.io>
>> >>> >> wrote:
>> >>> >>
>> >>> >>> For the module encoding categorical variables
>> >>> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d
>> >>> >>> ata__prep.html
>> >>> >>> does anyone have any suggestions on improvements that we could
>> make?
>> >>> >>>
>> >>> >>> Here is a video on how encoding categorical variables works for
>> >>> those not
>> >>> >>> familiar with it
>> >>> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
>> >>> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
>> >>> >>>
>> >>> >>
>> >>> >>
>> >>> >
>> >>>
>> >>
>> >>
>> >
>>
>
>

Re: Encoding categorical variables

Posted by Woo Jae Jung <wj...@pivotal.io>.

Also agree that double-quoted column names are not ideal.  In addition to
the net-new features described in this thread, it'd be nice to see
non-double-quoted output as default behavior in the
existing create_indicator_variables() function.

Thanks,
Woo

On Fri, Oct 28, 2016 at 1:05 PM, Woo Jae Jung <wj...@pivotal.io> wrote:

> I like the one-hot encoded feature.  Another variant of this idea would be
> an "all other" variable (distinct from the reference class) that contains
> occurrences of the less frequent category types.  In both of these
> scenarios, the threshold for 'less frequent' could be user-supplied.
>
> Thanks,
> Woo
>
> On Fri, Oct 28, 2016 at 11:29 AM, Rahul Iyer <ra...@gmail.com> wrote:
>
>> An alternative to dropping is to assign the less frequent values to the
>> reference i.e. all one-hot encoded features will be 0.
>> Also important to note: total runtime will increase with this option since
>> we'll have to compute the exact frequency distribution.
>>
>> Another suggested change is to call this function 'one_hot_encoding' since
>> that is the output here (similar to sklearn's OneHotEncoder
>> <http://scikit-learn.org/stable/modules/generated/sklearn.
>> preprocessing.OneHotEncoder.html>).
>> We can keep the current name as a deprecated alias till 2.0 is released.
>>
>> On Fri, Oct 28, 2016 at 11:17 AM, Frank McQuillan <fm...@pivotal.io>
>> wrote:
>>
>> > Jarrod,
>> >
>> > Just trying to write up detailed requirements.  How would you see this
>> one
>> > working?
>> >
>> > "2) Option to dummy code only the top n most frequently occurring
>> values in
>> > any column"
>> >
>> > With 1 column I can picture it, you would drop the rows with the less
>> > frequently occurring values and end up with a smaller table.  But what
>> if
>> > you are encoding multiple rows?    Would you want a per row
>> specification
>> > of n? i.e., top 3 values for column x, top 10 values for column y?  If
>> you
>> > did this then your result set might include low frequency values for
>> column
>> > x (not in top 3) because they are in the top 10 for column y - this
>> might
>> > be confusing.
>> >
>> > Frank
>> >
>> > On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan <fmcquillan@pivotal.io
>> >
>> > wrote:
>> >
>> >> great, thanks for the additional information
>> >>
>> >> Frank
>> >>
>> >> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey <jv...@pivotal.io>
>> >> wrote:
>> >>
>> >>> IMO
>> >>>
>> >>> 1) Option to define resulting column names. Please see pdltools
>> >>> implementation - the ability to pass in a function is especially
>> useful (
>> >>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
>> >>> 2) Option to dummy code only the top n most frequently occurring
>> values
>> >>> in
>> >>> any column
>> >>> 3) Option to create numeric column names (E.g. pivotcol_val1,
>> >>> pivotcol_val2
>> >>> ...) instead of values in column names + secondary mapping table
>> >>> 4) Option to exclude original column from results table
>> >>>
>> >>> (1) & (2) are much higher priority than (3) & (4).
>> >>>
>> >>> Agreed that these could also be applied to Pivoting (especially 1).
>> >>>
>> >>>
>> >>>
>> >>> Jarrod Vawdrey
>> >>> Sr. Data Scientist
>> >>> Data Science & Engineering | Pivotal
>> >>> (650) 315-8905
>> >>> https://pivotal.io/
>> >>>
>> >>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <
>> fmcquillan@pivotal.io>
>> >>> wrote:
>> >>>
>> >>> > Thanks for those suggestions, Jarrod.  They all sound pretty useful
>> -
>> >>> > would you mind taking a crack at numbering them 1,2,3... etc, in the
>> >>> order
>> >>> > of priority as you see it?
>> >>> >
>> >>> > Also it seems like some of these could be applied to the Pivot
>> >>> function as
>> >>> > well, e.g., UDF for column naming.
>> >>> >
>> >>> > Frank
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <
>> jvawdrey@pivotal.io>
>> >>> > wrote:
>> >>> >
>> >>> >> Hey Frank,
>> >>> >>
>> >>> >> How are special character values handled today? It is often not
>> ideal
>> >>> to
>> >>> >> end up with column names that require double quotes to call due to
>> >>> >> downstream scripts.
>> >>> >>
>> >>> >> A couple of features that would be useful
>> >>> >>
>> >>> >> * Option to define resulting column names. Please see pdltools
>> >>> >> implementation - the ability to pass in a function is especially
>> >>> useful (
>> >>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html
>> )
>> >>> >> * Option to dummy code only the top n most frequently occurring
>> >>> values in
>> >>> >> any column
>> >>> >> * Option to exclude original column from results table
>> >>> >> * Option to create numeric column names (E.g. pivotcol_val1,
>> >>> >> pivotcol_val2 ...) instead of values in column names + secondary
>> >>> mapping
>> >>> >> table
>> >>> >>
>> >>> >> Thank you
>> >>> >>
>> >>> >> Jarrod Vawdrey
>> >>> >> Sr. Data Scientist
>> >>> >> Data Science & Engineering | Pivotal
>> >>> >> (650) 315-8905
>> >>> >> https://pivotal.io/
>> >>> >>
>> >>> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <
>> >>> fmcquillan@pivotal.io>
>> >>> >> wrote:
>> >>> >>
>> >>> >>> For the module encoding categorical variables
>> >>> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d
>> >>> >>> ata__prep.html
>> >>> >>> does anyone have any suggestions on improvements that we could
>> make?
>> >>> >>>
>> >>> >>> Here is a video on how encoding categorical variables works for
>> >>> those not
>> >>> >>> familiar with it
>> >>> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
>> >>> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
>> >>> >>>
>> >>> >>
>> >>> >>
>> >>> >
>> >>>
>> >>
>> >>
>> >
>>
>
>

Re: Encoding categorical variables

Posted by Woo Jae Jung <wj...@pivotal.io>.

I like the one-hot encoded feature.  Another variant of this idea would be
an "all other" variable (distinct from the reference class) that contains
occurrences of the less frequent category types.  In both of these
scenarios, the threshold for 'less frequent' could be user-supplied.

Thanks,
Woo

On Fri, Oct 28, 2016 at 11:29 AM, Rahul Iyer <ra...@gmail.com> wrote:

> An alternative to dropping is to assign the less frequent values to the
> reference i.e. all one-hot encoded features will be 0.
> Also important to note: total runtime will increase with this option since
> we'll have to compute the exact frequency distribution.
>
> Another suggested change is to call this function 'one_hot_encoding' since
> that is the output here (similar to sklearn's OneHotEncoder
> <http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.
> OneHotEncoder.html>).
> We can keep the current name as a deprecated alias till 2.0 is released.
>
> On Fri, Oct 28, 2016 at 11:17 AM, Frank McQuillan <fm...@pivotal.io>
> wrote:
>
> > Jarrod,
> >
> > Just trying to write up detailed requirements.  How would you see this
> one
> > working?
> >
> > "2) Option to dummy code only the top n most frequently occurring values
> in
> > any column"
> >
> > With 1 column I can picture it, you would drop the rows with the less
> > frequently occurring values and end up with a smaller table.  But what if
> > you are encoding multiple rows?    Would you want a per row specification
> > of n? i.e., top 3 values for column x, top 10 values for column y?  If
> you
> > did this then your result set might include low frequency values for
> column
> > x (not in top 3) because they are in the top 10 for column y - this might
> > be confusing.
> >
> > Frank
> >
> > On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan <fm...@pivotal.io>
> > wrote:
> >
> >> great, thanks for the additional information
> >>
> >> Frank
> >>
> >> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey <jv...@pivotal.io>
> >> wrote:
> >>
> >>> IMO
> >>>
> >>> 1) Option to define resulting column names. Please see pdltools
> >>> implementation - the ability to pass in a function is especially
> useful (
> >>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
> >>> 2) Option to dummy code only the top n most frequently occurring values
> >>> in
> >>> any column
> >>> 3) Option to create numeric column names (E.g. pivotcol_val1,
> >>> pivotcol_val2
> >>> ...) instead of values in column names + secondary mapping table
> >>> 4) Option to exclude original column from results table
> >>>
> >>> (1) & (2) are much higher priority than (3) & (4).
> >>>
> >>> Agreed that these could also be applied to Pivoting (especially 1).
> >>>
> >>>
> >>>
> >>> Jarrod Vawdrey
> >>> Sr. Data Scientist
> >>> Data Science & Engineering | Pivotal
> >>> (650) 315-8905
> >>> https://pivotal.io/
> >>>
> >>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <
> fmcquillan@pivotal.io>
> >>> wrote:
> >>>
> >>> > Thanks for those suggestions, Jarrod.  They all sound pretty useful -
> >>> > would you mind taking a crack at numbering them 1,2,3... etc, in the
> >>> order
> >>> > of priority as you see it?
> >>> >
> >>> > Also it seems like some of these could be applied to the Pivot
> >>> function as
> >>> > well, e.g., UDF for column naming.
> >>> >
> >>> > Frank
> >>> >
> >>> >
> >>> >
> >>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <jvawdrey@pivotal.io
> >
> >>> > wrote:
> >>> >
> >>> >> Hey Frank,
> >>> >>
> >>> >> How are special character values handled today? It is often not
> ideal
> >>> to
> >>> >> end up with column names that require double quotes to call due to
> >>> >> downstream scripts.
> >>> >>
> >>> >> A couple of features that would be useful
> >>> >>
> >>> >> * Option to define resulting column names. Please see pdltools
> >>> >> implementation - the ability to pass in a function is especially
> >>> useful (
> >>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
> >>> >> * Option to dummy code only the top n most frequently occurring
> >>> values in
> >>> >> any column
> >>> >> * Option to exclude original column from results table
> >>> >> * Option to create numeric column names (E.g. pivotcol_val1,
> >>> >> pivotcol_val2 ...) instead of values in column names + secondary
> >>> mapping
> >>> >> table
> >>> >>
> >>> >> Thank you
> >>> >>
> >>> >> Jarrod Vawdrey
> >>> >> Sr. Data Scientist
> >>> >> Data Science & Engineering | Pivotal
> >>> >> (650) 315-8905
> >>> >> https://pivotal.io/
> >>> >>
> >>> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <
> >>> fmcquillan@pivotal.io>
> >>> >> wrote:
> >>> >>
> >>> >>> For the module encoding categorical variables
> >>> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d
> >>> >>> ata__prep.html
> >>> >>> does anyone have any suggestions on improvements that we could
> make?
> >>> >>>
> >>> >>> Here is a video on how encoding categorical variables works for
> >>> those not
> >>> >>> familiar with it
> >>> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
> >>> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
> >>> >>>
> >>> >>
> >>> >>
> >>> >
> >>>
> >>
> >>
> >
>

Re: Encoding categorical variables

Posted by Woo Jae Jung <wj...@pivotal.io>.

I like the one-hot encoded feature.  Another variant of this idea would be
an "all other" variable (distinct from the reference class) that contains
occurrences of the less frequent category types.  In both of these
scenarios, the threshold for 'less frequent' could be user-supplied.

Thanks,
Woo

On Fri, Oct 28, 2016 at 11:29 AM, Rahul Iyer <ra...@gmail.com> wrote:

> An alternative to dropping is to assign the less frequent values to the
> reference i.e. all one-hot encoded features will be 0.
> Also important to note: total runtime will increase with this option since
> we'll have to compute the exact frequency distribution.
>
> Another suggested change is to call this function 'one_hot_encoding' since
> that is the output here (similar to sklearn's OneHotEncoder
> <http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.
> OneHotEncoder.html>).
> We can keep the current name as a deprecated alias till 2.0 is released.
>
> On Fri, Oct 28, 2016 at 11:17 AM, Frank McQuillan <fm...@pivotal.io>
> wrote:
>
> > Jarrod,
> >
> > Just trying to write up detailed requirements.  How would you see this
> one
> > working?
> >
> > "2) Option to dummy code only the top n most frequently occurring values
> in
> > any column"
> >
> > With 1 column I can picture it, you would drop the rows with the less
> > frequently occurring values and end up with a smaller table.  But what if
> > you are encoding multiple rows?    Would you want a per row specification
> > of n? i.e., top 3 values for column x, top 10 values for column y?  If
> you
> > did this then your result set might include low frequency values for
> column
> > x (not in top 3) because they are in the top 10 for column y - this might
> > be confusing.
> >
> > Frank
> >
> > On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan <fm...@pivotal.io>
> > wrote:
> >
> >> great, thanks for the additional information
> >>
> >> Frank
> >>
> >> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey <jv...@pivotal.io>
> >> wrote:
> >>
> >>> IMO
> >>>
> >>> 1) Option to define resulting column names. Please see pdltools
> >>> implementation - the ability to pass in a function is especially
> useful (
> >>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
> >>> 2) Option to dummy code only the top n most frequently occurring values
> >>> in
> >>> any column
> >>> 3) Option to create numeric column names (E.g. pivotcol_val1,
> >>> pivotcol_val2
> >>> ...) instead of values in column names + secondary mapping table
> >>> 4) Option to exclude original column from results table
> >>>
> >>> (1) & (2) are much higher priority than (3) & (4).
> >>>
> >>> Agreed that these could also be applied to Pivoting (especially 1).
> >>>
> >>>
> >>>
> >>> Jarrod Vawdrey
> >>> Sr. Data Scientist
> >>> Data Science & Engineering | Pivotal
> >>> (650) 315-8905
> >>> https://pivotal.io/
> >>>
> >>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <
> fmcquillan@pivotal.io>
> >>> wrote:
> >>>
> >>> > Thanks for those suggestions, Jarrod.  They all sound pretty useful -
> >>> > would you mind taking a crack at numbering them 1,2,3... etc, in the
> >>> order
> >>> > of priority as you see it?
> >>> >
> >>> > Also it seems like some of these could be applied to the Pivot
> >>> function as
> >>> > well, e.g., UDF for column naming.
> >>> >
> >>> > Frank
> >>> >
> >>> >
> >>> >
> >>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <jvawdrey@pivotal.io
> >
> >>> > wrote:
> >>> >
> >>> >> Hey Frank,
> >>> >>
> >>> >> How are special character values handled today? It is often not
> ideal
> >>> to
> >>> >> end up with column names that require double quotes to call due to
> >>> >> downstream scripts.
> >>> >>
> >>> >> A couple of features that would be useful
> >>> >>
> >>> >> * Option to define resulting column names. Please see pdltools
> >>> >> implementation - the ability to pass in a function is especially
> >>> useful (
> >>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
> >>> >> * Option to dummy code only the top n most frequently occurring
> >>> values in
> >>> >> any column
> >>> >> * Option to exclude original column from results table
> >>> >> * Option to create numeric column names (E.g. pivotcol_val1,
> >>> >> pivotcol_val2 ...) instead of values in column names + secondary
> >>> mapping
> >>> >> table
> >>> >>
> >>> >> Thank you
> >>> >>
> >>> >> Jarrod Vawdrey
> >>> >> Sr. Data Scientist
> >>> >> Data Science & Engineering | Pivotal
> >>> >> (650) 315-8905
> >>> >> https://pivotal.io/
> >>> >>
> >>> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <
> >>> fmcquillan@pivotal.io>
> >>> >> wrote:
> >>> >>
> >>> >>> For the module encoding categorical variables
> >>> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d
> >>> >>> ata__prep.html
> >>> >>> does anyone have any suggestions on improvements that we could
> make?
> >>> >>>
> >>> >>> Here is a video on how encoding categorical variables works for
> >>> those not
> >>> >>> familiar with it
> >>> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
> >>> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
> >>> >>>
> >>> >>
> >>> >>
> >>> >
> >>>
> >>
> >>
> >
>

Re: Encoding categorical variables

Posted by Rahul Iyer <ra...@gmail.com>.

An alternative to dropping is to assign the less frequent values to the
reference i.e. all one-hot encoded features will be 0.
Also important to note: total runtime will increase with this option since
we'll have to compute the exact frequency distribution.

Another suggested change is to call this function 'one_hot_encoding' since
that is the output here (similar to sklearn's OneHotEncoder
<http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html>).
We can keep the current name as a deprecated alias till 2.0 is released.

On Fri, Oct 28, 2016 at 11:17 AM, Frank McQuillan <fm...@pivotal.io>
wrote:

> Jarrod,
>
> Just trying to write up detailed requirements.  How would you see this one
> working?
>
> "2) Option to dummy code only the top n most frequently occurring values in
> any column"
>
> With 1 column I can picture it, you would drop the rows with the less
> frequently occurring values and end up with a smaller table.  But what if
> you are encoding multiple rows?    Would you want a per row specification
> of n? i.e., top 3 values for column x, top 10 values for column y?  If you
> did this then your result set might include low frequency values for column
> x (not in top 3) because they are in the top 10 for column y - this might
> be confusing.
>
> Frank
>
> On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan <fm...@pivotal.io>
> wrote:
>
>> great, thanks for the additional information
>>
>> Frank
>>
>> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey <jv...@pivotal.io>
>> wrote:
>>
>>> IMO
>>>
>>> 1) Option to define resulting column names. Please see pdltools
>>> implementation - the ability to pass in a function is especially useful (
>>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
>>> 2) Option to dummy code only the top n most frequently occurring values
>>> in
>>> any column
>>> 3) Option to create numeric column names (E.g. pivotcol_val1,
>>> pivotcol_val2
>>> ...) instead of values in column names + secondary mapping table
>>> 4) Option to exclude original column from results table
>>>
>>> (1) & (2) are much higher priority than (3) & (4).
>>>
>>> Agreed that these could also be applied to Pivoting (especially 1).
>>>
>>>
>>>
>>> Jarrod Vawdrey
>>> Sr. Data Scientist
>>> Data Science & Engineering | Pivotal
>>> (650) 315-8905
>>> https://pivotal.io/
>>>
>>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <fm...@pivotal.io>
>>> wrote:
>>>
>>> > Thanks for those suggestions, Jarrod.  They all sound pretty useful -
>>> > would you mind taking a crack at numbering them 1,2,3... etc, in the
>>> order
>>> > of priority as you see it?
>>> >
>>> > Also it seems like some of these could be applied to the Pivot
>>> function as
>>> > well, e.g., UDF for column naming.
>>> >
>>> > Frank
>>> >
>>> >
>>> >
>>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <jv...@pivotal.io>
>>> > wrote:
>>> >
>>> >> Hey Frank,
>>> >>
>>> >> How are special character values handled today? It is often not ideal
>>> to
>>> >> end up with column names that require double quotes to call due to
>>> >> downstream scripts.
>>> >>
>>> >> A couple of features that would be useful
>>> >>
>>> >> * Option to define resulting column names. Please see pdltools
>>> >> implementation - the ability to pass in a function is especially
>>> useful (
>>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
>>> >> * Option to dummy code only the top n most frequently occurring
>>> values in
>>> >> any column
>>> >> * Option to exclude original column from results table
>>> >> * Option to create numeric column names (E.g. pivotcol_val1,
>>> >> pivotcol_val2 ...) instead of values in column names + secondary
>>> mapping
>>> >> table
>>> >>
>>> >> Thank you
>>> >>
>>> >> Jarrod Vawdrey
>>> >> Sr. Data Scientist
>>> >> Data Science & Engineering | Pivotal
>>> >> (650) 315-8905
>>> >> https://pivotal.io/
>>> >>
>>> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <
>>> fmcquillan@pivotal.io>
>>> >> wrote:
>>> >>
>>> >>> For the module encoding categorical variables
>>> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d
>>> >>> ata__prep.html
>>> >>> does anyone have any suggestions on improvements that we could make?
>>> >>>
>>> >>> Here is a video on how encoding categorical variables works for
>>> those not
>>> >>> familiar with it
>>> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
>>> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
>>> >>>
>>> >>
>>> >>
>>> >
>>>
>>
>>
>

Re: Encoding categorical variables

Posted by Rahul Iyer <ra...@gmail.com>.

An alternative to dropping is to assign the less frequent values to the
reference i.e. all one-hot encoded features will be 0.
Also important to note: total runtime will increase with this option since
we'll have to compute the exact frequency distribution.

Another suggested change is to call this function 'one_hot_encoding' since
that is the output here (similar to sklearn's OneHotEncoder
<http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html>).
We can keep the current name as a deprecated alias till 2.0 is released.

On Fri, Oct 28, 2016 at 11:17 AM, Frank McQuillan <fm...@pivotal.io>
wrote:

> Jarrod,
>
> Just trying to write up detailed requirements.  How would you see this one
> working?
>
> "2) Option to dummy code only the top n most frequently occurring values in
> any column"
>
> With 1 column I can picture it, you would drop the rows with the less
> frequently occurring values and end up with a smaller table.  But what if
> you are encoding multiple rows?    Would you want a per row specification
> of n? i.e., top 3 values for column x, top 10 values for column y?  If you
> did this then your result set might include low frequency values for column
> x (not in top 3) because they are in the top 10 for column y - this might
> be confusing.
>
> Frank
>
> On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan <fm...@pivotal.io>
> wrote:
>
>> great, thanks for the additional information
>>
>> Frank
>>
>> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey <jv...@pivotal.io>
>> wrote:
>>
>>> IMO
>>>
>>> 1) Option to define resulting column names. Please see pdltools
>>> implementation - the ability to pass in a function is especially useful (
>>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
>>> 2) Option to dummy code only the top n most frequently occurring values
>>> in
>>> any column
>>> 3) Option to create numeric column names (E.g. pivotcol_val1,
>>> pivotcol_val2
>>> ...) instead of values in column names + secondary mapping table
>>> 4) Option to exclude original column from results table
>>>
>>> (1) & (2) are much higher priority than (3) & (4).
>>>
>>> Agreed that these could also be applied to Pivoting (especially 1).
>>>
>>>
>>>
>>> Jarrod Vawdrey
>>> Sr. Data Scientist
>>> Data Science & Engineering | Pivotal
>>> (650) 315-8905
>>> https://pivotal.io/
>>>
>>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <fm...@pivotal.io>
>>> wrote:
>>>
>>> > Thanks for those suggestions, Jarrod.  They all sound pretty useful -
>>> > would you mind taking a crack at numbering them 1,2,3... etc, in the
>>> order
>>> > of priority as you see it?
>>> >
>>> > Also it seems like some of these could be applied to the Pivot
>>> function as
>>> > well, e.g., UDF for column naming.
>>> >
>>> > Frank
>>> >
>>> >
>>> >
>>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <jv...@pivotal.io>
>>> > wrote:
>>> >
>>> >> Hey Frank,
>>> >>
>>> >> How are special character values handled today? It is often not ideal
>>> to
>>> >> end up with column names that require double quotes to call due to
>>> >> downstream scripts.
>>> >>
>>> >> A couple of features that would be useful
>>> >>
>>> >> * Option to define resulting column names. Please see pdltools
>>> >> implementation - the ability to pass in a function is especially
>>> useful (
>>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
>>> >> * Option to dummy code only the top n most frequently occurring
>>> values in
>>> >> any column
>>> >> * Option to exclude original column from results table
>>> >> * Option to create numeric column names (E.g. pivotcol_val1,
>>> >> pivotcol_val2 ...) instead of values in column names + secondary
>>> mapping
>>> >> table
>>> >>
>>> >> Thank you
>>> >>
>>> >> Jarrod Vawdrey
>>> >> Sr. Data Scientist
>>> >> Data Science & Engineering | Pivotal
>>> >> (650) 315-8905
>>> >> https://pivotal.io/
>>> >>
>>> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <
>>> fmcquillan@pivotal.io>
>>> >> wrote:
>>> >>
>>> >>> For the module encoding categorical variables
>>> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d
>>> >>> ata__prep.html
>>> >>> does anyone have any suggestions on improvements that we could make?
>>> >>>
>>> >>> Here is a video on how encoding categorical variables works for
>>> those not
>>> >>> familiar with it
>>> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
>>> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
>>> >>>
>>> >>
>>> >>
>>> >
>>>
>>
>>
>

Re: Encoding categorical variables

Posted by Frank McQuillan <fm...@pivotal.io>.

Jarrod,

Just trying to write up detailed requirements.  How would you see this one
working?

"2) Option to dummy code only the top n most frequently occurring values in
any column"

With 1 column I can picture it, you would drop the rows with the less
frequently occurring values and end up with a smaller table.  But what if
you are encoding multiple rows?    Would you want a per row specification
of n? i.e., top 3 values for column x, top 10 values for column y?  If you
did this then your result set might include low frequency values for column
x (not in top 3) because they are in the top 10 for column y - this might
be confusing.

Frank

On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan <fm...@pivotal.io>
wrote:

> great, thanks for the additional information
>
> Frank
>
> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey <jv...@pivotal.io>
> wrote:
>
>> IMO
>>
>> 1) Option to define resulting column names. Please see pdltools
>> implementation - the ability to pass in a function is especially useful (
>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
>> 2) Option to dummy code only the top n most frequently occurring values in
>> any column
>> 3) Option to create numeric column names (E.g. pivotcol_val1,
>> pivotcol_val2
>> ...) instead of values in column names + secondary mapping table
>> 4) Option to exclude original column from results table
>>
>> (1) & (2) are much higher priority than (3) & (4).
>>
>> Agreed that these could also be applied to Pivoting (especially 1).
>>
>>
>>
>> Jarrod Vawdrey
>> Sr. Data Scientist
>> Data Science & Engineering | Pivotal
>> (650) 315-8905
>> https://pivotal.io/
>>
>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <fm...@pivotal.io>
>> wrote:
>>
>> > Thanks for those suggestions, Jarrod.  They all sound pretty useful -
>> > would you mind taking a crack at numbering them 1,2,3... etc, in the
>> order
>> > of priority as you see it?
>> >
>> > Also it seems like some of these could be applied to the Pivot function
>> as
>> > well, e.g., UDF for column naming.
>> >
>> > Frank
>> >
>> >
>> >
>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <jv...@pivotal.io>
>> > wrote:
>> >
>> >> Hey Frank,
>> >>
>> >> How are special character values handled today? It is often not ideal
>> to
>> >> end up with column names that require double quotes to call due to
>> >> downstream scripts.
>> >>
>> >> A couple of features that would be useful
>> >>
>> >> * Option to define resulting column names. Please see pdltools
>> >> implementation - the ability to pass in a function is especially
>> useful (
>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
>> >> * Option to dummy code only the top n most frequently occurring values
>> in
>> >> any column
>> >> * Option to exclude original column from results table
>> >> * Option to create numeric column names (E.g. pivotcol_val1,
>> >> pivotcol_val2 ...) instead of values in column names + secondary
>> mapping
>> >> table
>> >>
>> >> Thank you
>> >>
>> >> Jarrod Vawdrey
>> >> Sr. Data Scientist
>> >> Data Science & Engineering | Pivotal
>> >> (650) 315-8905
>> >> https://pivotal.io/
>> >>
>> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <
>> fmcquillan@pivotal.io>
>> >> wrote:
>> >>
>> >>> For the module encoding categorical variables
>> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d
>> >>> ata__prep.html
>> >>> does anyone have any suggestions on improvements that we could make?
>> >>>
>> >>> Here is a video on how encoding categorical variables works for those
>> not
>> >>> familiar with it
>> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
>> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
>> >>>
>> >>
>> >>
>> >
>>
>
>

Re: Encoding categorical variables

Posted by Frank McQuillan <fm...@pivotal.io>.

Jarrod,

Just trying to write up detailed requirements.  How would you see this one
working?

"2) Option to dummy code only the top n most frequently occurring values in
any column"

With 1 column I can picture it, you would drop the rows with the less
frequently occurring values and end up with a smaller table.  But what if
you are encoding multiple rows?    Would you want a per row specification
of n? i.e., top 3 values for column x, top 10 values for column y?  If you
did this then your result set might include low frequency values for column
x (not in top 3) because they are in the top 10 for column y - this might
be confusing.

Frank

On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan <fm...@pivotal.io>
wrote:

> great, thanks for the additional information
>
> Frank
>
> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey <jv...@pivotal.io>
> wrote:
>
>> IMO
>>
>> 1) Option to define resulting column names. Please see pdltools
>> implementation - the ability to pass in a function is especially useful (
>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
>> 2) Option to dummy code only the top n most frequently occurring values in
>> any column
>> 3) Option to create numeric column names (E.g. pivotcol_val1,
>> pivotcol_val2
>> ...) instead of values in column names + secondary mapping table
>> 4) Option to exclude original column from results table
>>
>> (1) & (2) are much higher priority than (3) & (4).
>>
>> Agreed that these could also be applied to Pivoting (especially 1).
>>
>>
>>
>> Jarrod Vawdrey
>> Sr. Data Scientist
>> Data Science & Engineering | Pivotal
>> (650) 315-8905
>> https://pivotal.io/
>>
>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <fm...@pivotal.io>
>> wrote:
>>
>> > Thanks for those suggestions, Jarrod.  They all sound pretty useful -
>> > would you mind taking a crack at numbering them 1,2,3... etc, in the
>> order
>> > of priority as you see it?
>> >
>> > Also it seems like some of these could be applied to the Pivot function
>> as
>> > well, e.g., UDF for column naming.
>> >
>> > Frank
>> >
>> >
>> >
>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <jv...@pivotal.io>
>> > wrote:
>> >
>> >> Hey Frank,
>> >>
>> >> How are special character values handled today? It is often not ideal
>> to
>> >> end up with column names that require double quotes to call due to
>> >> downstream scripts.
>> >>
>> >> A couple of features that would be useful
>> >>
>> >> * Option to define resulting column names. Please see pdltools
>> >> implementation - the ability to pass in a function is especially
>> useful (
>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
>> >> * Option to dummy code only the top n most frequently occurring values
>> in
>> >> any column
>> >> * Option to exclude original column from results table
>> >> * Option to create numeric column names (E.g. pivotcol_val1,
>> >> pivotcol_val2 ...) instead of values in column names + secondary
>> mapping
>> >> table
>> >>
>> >> Thank you
>> >>
>> >> Jarrod Vawdrey
>> >> Sr. Data Scientist
>> >> Data Science & Engineering | Pivotal
>> >> (650) 315-8905
>> >> https://pivotal.io/
>> >>
>> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <
>> fmcquillan@pivotal.io>
>> >> wrote:
>> >>
>> >>> For the module encoding categorical variables
>> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d
>> >>> ata__prep.html
>> >>> does anyone have any suggestions on improvements that we could make?
>> >>>
>> >>> Here is a video on how encoding categorical variables works for those
>> not
>> >>> familiar with it
>> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
>> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
>> >>>
>> >>
>> >>
>> >
>>
>
>

Re: Encoding categorical variables

Posted by Frank McQuillan <fm...@pivotal.io>.

great, thanks for the additional information

Frank

On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey <jv...@pivotal.io> wrote:

> IMO
>
> 1) Option to define resulting column names. Please see pdltools
> implementation - the ability to pass in a function is especially useful (
> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
> 2) Option to dummy code only the top n most frequently occurring values in
> any column
> 3) Option to create numeric column names (E.g. pivotcol_val1, pivotcol_val2
> ...) instead of values in column names + secondary mapping table
> 4) Option to exclude original column from results table
>
> (1) & (2) are much higher priority than (3) & (4).
>
> Agreed that these could also be applied to Pivoting (especially 1).
>
>
>
> Jarrod Vawdrey
> Sr. Data Scientist
> Data Science & Engineering | Pivotal
> (650) 315-8905
> https://pivotal.io/
>
> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <fm...@pivotal.io>
> wrote:
>
> > Thanks for those suggestions, Jarrod.  They all sound pretty useful -
> > would you mind taking a crack at numbering them 1,2,3... etc, in the
> order
> > of priority as you see it?
> >
> > Also it seems like some of these could be applied to the Pivot function
> as
> > well, e.g., UDF for column naming.
> >
> > Frank
> >
> >
> >
> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <jv...@pivotal.io>
> > wrote:
> >
> >> Hey Frank,
> >>
> >> How are special character values handled today? It is often not ideal to
> >> end up with column names that require double quotes to call due to
> >> downstream scripts.
> >>
> >> A couple of features that would be useful
> >>
> >> * Option to define resulting column names. Please see pdltools
> >> implementation - the ability to pass in a function is especially useful
> (
> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
> >> * Option to dummy code only the top n most frequently occurring values
> in
> >> any column
> >> * Option to exclude original column from results table
> >> * Option to create numeric column names (E.g. pivotcol_val1,
> >> pivotcol_val2 ...) instead of values in column names + secondary mapping
> >> table
> >>
> >> Thank you
> >>
> >> Jarrod Vawdrey
> >> Sr. Data Scientist
> >> Data Science & Engineering | Pivotal
> >> (650) 315-8905
> >> https://pivotal.io/
> >>
> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <fmcquillan@pivotal.io
> >
> >> wrote:
> >>
> >>> For the module encoding categorical variables
> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d
> >>> ata__prep.html
> >>> does anyone have any suggestions on improvements that we could make?
> >>>
> >>> Here is a video on how encoding categorical variables works for those
> not
> >>> familiar with it
> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
> >>>
> >>
> >>
> >
>

Re: Encoding categorical variables

Posted by Frank McQuillan <fm...@pivotal.io>.

great, thanks for the additional information

Frank

On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey <jv...@pivotal.io> wrote:

> IMO
>
> 1) Option to define resulting column names. Please see pdltools
> implementation - the ability to pass in a function is especially useful (
> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
> 2) Option to dummy code only the top n most frequently occurring values in
> any column
> 3) Option to create numeric column names (E.g. pivotcol_val1, pivotcol_val2
> ...) instead of values in column names + secondary mapping table
> 4) Option to exclude original column from results table
>
> (1) & (2) are much higher priority than (3) & (4).
>
> Agreed that these could also be applied to Pivoting (especially 1).
>
>
>
> Jarrod Vawdrey
> Sr. Data Scientist
> Data Science & Engineering | Pivotal
> (650) 315-8905
> https://pivotal.io/
>
> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <fm...@pivotal.io>
> wrote:
>
> > Thanks for those suggestions, Jarrod.  They all sound pretty useful -
> > would you mind taking a crack at numbering them 1,2,3... etc, in the
> order
> > of priority as you see it?
> >
> > Also it seems like some of these could be applied to the Pivot function
> as
> > well, e.g., UDF for column naming.
> >
> > Frank
> >
> >
> >
> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <jv...@pivotal.io>
> > wrote:
> >
> >> Hey Frank,
> >>
> >> How are special character values handled today? It is often not ideal to
> >> end up with column names that require double quotes to call due to
> >> downstream scripts.
> >>
> >> A couple of features that would be useful
> >>
> >> * Option to define resulting column names. Please see pdltools
> >> implementation - the ability to pass in a function is especially useful
> (
> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
> >> * Option to dummy code only the top n most frequently occurring values
> in
> >> any column
> >> * Option to exclude original column from results table
> >> * Option to create numeric column names (E.g. pivotcol_val1,
> >> pivotcol_val2 ...) instead of values in column names + secondary mapping
> >> table
> >>
> >> Thank you
> >>
> >> Jarrod Vawdrey
> >> Sr. Data Scientist
> >> Data Science & Engineering | Pivotal
> >> (650) 315-8905
> >> https://pivotal.io/
> >>
> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <fmcquillan@pivotal.io
> >
> >> wrote:
> >>
> >>> For the module encoding categorical variables
> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d
> >>> ata__prep.html
> >>> does anyone have any suggestions on improvements that we could make?
> >>>
> >>> Here is a video on how encoding categorical variables works for those
> not
> >>> familiar with it
> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
> >>>
> >>
> >>
> >
>

Re: Encoding categorical variables

Posted by Jarrod Vawdrey <jv...@pivotal.io>.

IMO

1) Option to define resulting column names. Please see pdltools
implementation - the ability to pass in a function is especially useful (
http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
2) Option to dummy code only the top n most frequently occurring values in
any column
3) Option to create numeric column names (E.g. pivotcol_val1, pivotcol_val2
...) instead of values in column names + secondary mapping table
4) Option to exclude original column from results table

(1) & (2) are much higher priority than (3) & (4).

Agreed that these could also be applied to Pivoting (especially 1).



Jarrod Vawdrey
Sr. Data Scientist
Data Science & Engineering | Pivotal
(650) 315-8905
https://pivotal.io/

On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <fm...@pivotal.io>
wrote:

> Thanks for those suggestions, Jarrod.  They all sound pretty useful -
> would you mind taking a crack at numbering them 1,2,3... etc, in the order
> of priority as you see it?
>
> Also it seems like some of these could be applied to the Pivot function as
> well, e.g., UDF for column naming.
>
> Frank
>
>
>
> On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <jv...@pivotal.io>
> wrote:
>
>> Hey Frank,
>>
>> How are special character values handled today? It is often not ideal to
>> end up with column names that require double quotes to call due to
>> downstream scripts.
>>
>> A couple of features that would be useful
>>
>> * Option to define resulting column names. Please see pdltools
>> implementation - the ability to pass in a function is especially useful (
>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
>> * Option to dummy code only the top n most frequently occurring values in
>> any column
>> * Option to exclude original column from results table
>> * Option to create numeric column names (E.g. pivotcol_val1,
>> pivotcol_val2 ...) instead of values in column names + secondary mapping
>> table
>>
>> Thank you
>>
>> Jarrod Vawdrey
>> Sr. Data Scientist
>> Data Science & Engineering | Pivotal
>> (650) 315-8905
>> https://pivotal.io/
>>
>> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <fm...@pivotal.io>
>> wrote:
>>
>>> For the module encoding categorical variables
>>> http://madlib.incubator.apache.org/docs/latest/group__grp__d
>>> ata__prep.html
>>> does anyone have any suggestions on improvements that we could make?
>>>
>>> Here is a video on how encoding categorical variables works for those not
>>> familiar with it
>>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
>>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
>>>
>>
>>
>

Re: Encoding categorical variables

Posted by Jarrod Vawdrey <jv...@pivotal.io>.

IMO

1) Option to define resulting column names. Please see pdltools
implementation - the ability to pass in a function is especially useful (
http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
2) Option to dummy code only the top n most frequently occurring values in
any column
3) Option to create numeric column names (E.g. pivotcol_val1, pivotcol_val2
...) instead of values in column names + secondary mapping table
4) Option to exclude original column from results table

(1) & (2) are much higher priority than (3) & (4).

Agreed that these could also be applied to Pivoting (especially 1).



Jarrod Vawdrey
Sr. Data Scientist
Data Science & Engineering | Pivotal
(650) 315-8905
https://pivotal.io/

On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <fm...@pivotal.io>
wrote:

> Thanks for those suggestions, Jarrod.  They all sound pretty useful -
> would you mind taking a crack at numbering them 1,2,3... etc, in the order
> of priority as you see it?
>
> Also it seems like some of these could be applied to the Pivot function as
> well, e.g., UDF for column naming.
>
> Frank
>
>
>
> On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <jv...@pivotal.io>
> wrote:
>
>> Hey Frank,
>>
>> How are special character values handled today? It is often not ideal to
>> end up with column names that require double quotes to call due to
>> downstream scripts.
>>
>> A couple of features that would be useful
>>
>> * Option to define resulting column names. Please see pdltools
>> implementation - the ability to pass in a function is especially useful (
>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
>> * Option to dummy code only the top n most frequently occurring values in
>> any column
>> * Option to exclude original column from results table
>> * Option to create numeric column names (E.g. pivotcol_val1,
>> pivotcol_val2 ...) instead of values in column names + secondary mapping
>> table
>>
>> Thank you
>>
>> Jarrod Vawdrey
>> Sr. Data Scientist
>> Data Science & Engineering | Pivotal
>> (650) 315-8905
>> https://pivotal.io/
>>
>> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <fm...@pivotal.io>
>> wrote:
>>
>>> For the module encoding categorical variables
>>> http://madlib.incubator.apache.org/docs/latest/group__grp__d
>>> ata__prep.html
>>> does anyone have any suggestions on improvements that we could make?
>>>
>>> Here is a video on how encoding categorical variables works for those not
>>> familiar with it
>>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
>>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
>>>
>>
>>
>

Re: Encoding categorical variables

Posted by Frank McQuillan <fm...@pivotal.io>.

Thanks for those suggestions, Jarrod.  They all sound pretty useful - would
you mind taking a crack at numbering them 1,2,3... etc, in the order of
priority as you see it?

Also it seems like some of these could be applied to the Pivot function as
well, e.g., UDF for column naming.

Frank



On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <jv...@pivotal.io> wrote:

> Hey Frank,
>
> How are special character values handled today? It is often not ideal to
> end up with column names that require double quotes to call due to
> downstream scripts.
>
> A couple of features that would be useful
>
> * Option to define resulting column names. Please see pdltools
> implementation - the ability to pass in a function is especially useful (
> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
> * Option to dummy code only the top n most frequently occurring values in
> any column
> * Option to exclude original column from results table
> * Option to create numeric column names (E.g. pivotcol_val1, pivotcol_val2
> ...) instead of values in column names + secondary mapping table
>
> Thank you
>
> Jarrod Vawdrey
> Sr. Data Scientist
> Data Science & Engineering | Pivotal
> (650) 315-8905
> https://pivotal.io/
>
> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <fm...@pivotal.io>
> wrote:
>
>> For the module encoding categorical variables
>> http://madlib.incubator.apache.org/docs/latest/group__grp__
>> data__prep.html
>> does anyone have any suggestions on improvements that we could make?
>>
>> Here is a video on how encoding categorical variables works for those not
>> familiar with it
>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
>>
>
>

Re: Encoding categorical variables

Posted by Frank McQuillan <fm...@pivotal.io>.

Thanks for those suggestions, Jarrod.  They all sound pretty useful - would
you mind taking a crack at numbering them 1,2,3... etc, in the order of
priority as you see it?

Also it seems like some of these could be applied to the Pivot function as
well, e.g., UDF for column naming.

Frank



On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <jv...@pivotal.io> wrote:

> Hey Frank,
>
> How are special character values handled today? It is often not ideal to
> end up with column names that require double quotes to call due to
> downstream scripts.
>
> A couple of features that would be useful
>
> * Option to define resulting column names. Please see pdltools
> implementation - the ability to pass in a function is especially useful (
> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
> * Option to dummy code only the top n most frequently occurring values in
> any column
> * Option to exclude original column from results table
> * Option to create numeric column names (E.g. pivotcol_val1, pivotcol_val2
> ...) instead of values in column names + secondary mapping table
>
> Thank you
>
> Jarrod Vawdrey
> Sr. Data Scientist
> Data Science & Engineering | Pivotal
> (650) 315-8905
> https://pivotal.io/
>
> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <fm...@pivotal.io>
> wrote:
>
>> For the module encoding categorical variables
>> http://madlib.incubator.apache.org/docs/latest/group__grp__
>> data__prep.html
>> does anyone have any suggestions on improvements that we could make?
>>
>> Here is a video on how encoding categorical variables works for those not
>> familiar with it
>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
>>
>
>

Re: Encoding categorical variables

Posted by Jarrod Vawdrey <jv...@pivotal.io>.

Hey Frank,

How are special character values handled today? It is often not ideal to
end up with column names that require double quotes to call due to
downstream scripts.

A couple of features that would be useful

* Option to define resulting column names. Please see pdltools
implementation - the ability to pass in a function is especially useful (
http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
* Option to dummy code only the top n most frequently occurring values in
any column
* Option to exclude original column from results table
* Option to create numeric column names (E.g. pivotcol_val1, pivotcol_val2
...) instead of values in column names + secondary mapping table

Thank you

Jarrod Vawdrey
Sr. Data Scientist
Data Science & Engineering | Pivotal
(650) 315-8905
https://pivotal.io/

On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <fm...@pivotal.io>
wrote:

> For the module encoding categorical variables
> http://madlib.incubator.apache.org/docs/latest/group__grp__data__prep.html
> does anyone have any suggestions on improvements that we could make?
>
> Here is a video on how encoding categorical variables works for those not
> familiar with it
> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL62pIycqXx-
> Qf6EXu5FDxUgXW23BHOtcQ
>

Re: Encoding categorical variables

Posted by Jarrod Vawdrey <jv...@pivotal.io>.

Hey Frank,

How are special character values handled today? It is often not ideal to
end up with column names that require double quotes to call due to
downstream scripts.

A couple of features that would be useful

* Option to define resulting column names. Please see pdltools
implementation - the ability to pass in a function is especially useful (
http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
* Option to dummy code only the top n most frequently occurring values in
any column
* Option to exclude original column from results table
* Option to create numeric column names (E.g. pivotcol_val1, pivotcol_val2
...) instead of values in column names + secondary mapping table

Thank you

Jarrod Vawdrey
Sr. Data Scientist
Data Science & Engineering | Pivotal
(650) 315-8905
https://pivotal.io/

On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <fm...@pivotal.io>
wrote:

> For the module encoding categorical variables
> http://madlib.incubator.apache.org/docs/latest/group__grp__data__prep.html
> does anyone have any suggestions on improvements that we could make?
>
> Here is a video on how encoding categorical variables works for those not
> familiar with it
> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL62pIycqXx-
> Qf6EXu5FDxUgXW23BHOtcQ
>