You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@madlib.apache.org by iyerr3 <gi...@git.apache.org> on 2016/05/23 17:53:55 UTC

[GitHub] incubator-madlib pull request: SVM: Add class weights for use with...

GitHub user iyerr3 opened a pull request:

    https://github.com/apache/incubator-madlib/pull/43

    SVM: Add class weights for use with unbalanced data

    JIRA: MADLIB-998
    
    Added 'class_weight' in the 'params' argument. It can either be a string or a
    dictionary-like mapping. In case of a string, we currently only accept
    'balanced' as an option. For a mapping, the user can map values of the
    dependent variable to specific double precision weights. Since, SVM only
    supports binary classification at present, the class_weight mapping can only
    take upto two arguments.
    
    As part of this work, we add a 'tuple_weight' argument to the SVM aggregate.
    This allows future addition of sample weights (which would be multiplied with
    the class weight).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/iyerr3/incubator-madlib feature/svm_class_weights

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-madlib/pull/43.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #43
    
----
commit 545bf14c46d1ae5e75088e83abed1c26a841d4f7
Author: Rahul Iyer <ri...@apache.org>
Date:   2016-05-20T23:28:23Z

    SVM: Add class weights for use with unbalanced data
    
    JIRA: MADLIB-998
    
    Added 'class_weight' in the 'params' argument. It can either be a string or a
    dictionary-like mapping. In case of a string, we currently only accept
    'balanced' as an option. For a mapping, the user can map values of the
    dependent variable to specific double precision weights. Since, SVM only
    supports binary classification at present, the class_weight mapping can only
    take upto two arguments.
    
    As part of this work, we add a 'tuple_weight' argument to the SVM aggregate.
    This allows future addition of sample weights (which would be multiplied with
    the class weight).

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request: SVM: Add class weights for use with...

Posted by fmcquillan99 <gi...@git.apache.org>.
Github user fmcquillan99 commented on the pull request:

    https://github.com/apache/incubator-madlib/pull/43#issuecomment-221105709
  
    Thanks for posting the the pics.  I would suggest we create another JIRA to look at the heuristic proposed by L�on Bottou.  I'll do that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #43: SVM: Add class weights for use with unbal...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/incubator-madlib/pull/43


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request: SVM: Add class weights for use with...

Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on the pull request:

    https://github.com/apache/incubator-madlib/pull/43#issuecomment-221046547
  
    Testing with scikit learn revealed odd behavior. See figures below.
    1) With equal class size the results are similar between scikit and madlib.
    2) With unequal class sizes and class_weight = balanced the results are also close enough (dashed line). But the hyperplane without class weights are quite different with MADlib being obviously wrong.
    
    ![with_weights](https://cloud.githubusercontent.com/assets/2237447/15479453/18d1bfbe-20d5-11e6-9b74-05dfea88a157.png)
    ![no_weights](https://cloud.githubusercontent.com/assets/2237447/15479452/18cd7c24-20d5-11e6-95a2-00bc7d5d8ab5.png)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request: SVM: Add class weights for use with...

Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on the pull request:

    https://github.com/apache/incubator-madlib/pull/43#issuecomment-221083387
  
    Things are better when we don't use the default step size. 
    
    Using init_stepsize=0.1 instead of the default (0.01): 
    
    ![with_weights_higher_init](https://cloud.githubusercontent.com/assets/2237447/15483072/bfea5722-20e7-11e6-8cce-614452bcbce5.png)
    
    We need to investigate the optimal learning schedule used in scikit-learn, where the initial stepsize is determined based on a heuristic proposed by L�on Bottou. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request: SVM: Add class weights for use with...

Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/43#discussion_r64303826
  
    --- Diff: src/ports/postgres/modules/svm/svm.py_in ---
    @@ -781,6 +785,46 @@ def _random_feature_map(schema_madlib, source_table, dependent_varname,
                            dependent_varname, grouping_col))
     
     
    +def _compute_class_weight_sql(source_table, dependent_varname,
    +                              is_svc, class_weight_str):
    +    """
    +    Args:
    +        @param is_svc: Boolean, indicates if classification or regression
    +
    +    Returns:
    +        str. String when executed in SQL computes the class weight for each tuple
    +    """
    +    if not is_svc or not class_weight_str:
    +        return "1"
    +
    +    dep_to_weight = defaultdict(float)
    +    if class_weight_str == "balanced":
    +        # use half of n_samples since only doing binary classification
    +        # Change the '2' to n_classes for multinomial
    +        n_samples_per_class = num_samples(source_table) / 2
    +        bin_count = plpy.execute("""SELECT {dep} as k, count(*) as v
    +                                    FROM {src}
    +                                    GROUP BY {dep}
    +                                 """.format(dep=dependent_varname,
    +                                            src=source_table))
    +        for each_count in bin_count:
    +            plpy.info(each_count)
    --- End diff --
    
    remove the 'info' line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---