You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@madlib.apache.org by iyerr3 <gi...@git.apache.org> on 2016/06/17 21:12:57 UTC

[GitHub] incubator-madlib pull request #48: SVM: Novelty detection using 1-class SVM

GitHub user iyerr3 opened a pull request:

    https://github.com/apache/incubator-madlib/pull/48

    SVM: Novelty detection using 1-class SVM

    Jira: MADLIB-990
    
    Additional author: Nandish Jayaram <nj...@pivotal.io>
    
    In this implementation of a one-class SVM, we are piggy-backing on the existing
    SVM classification. The input table to a one-class SVM does not require a
    dependent variable. A maximum-margin classifier is learned that separates all
    the data from the origin. The default kernel for one-class is Gaussian (rbf).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/iyerr3/incubator-madlib feature/svm_one_class

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-madlib/pull/48.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #48
    
----
commit c4211cef7043081852e1346218e81a78800a7428
Author: Rahul Iyer <ri...@apache.org>
Date:   2016-04-21T18:38:28Z

    SVM: Novelty detection using 1-class SVM
    
    Jira: MADLIB-990
    
    Additional author: Nandish Jayaram <nj...@pivotal.io>
    
    In this implementation of a one-class SVM, we are piggy-backing on the existing
    SVM classification. The input table to a one-class SVM does not require a
    dependent variable. A maximum-margin classifier is learned that separates all
    the data from the origin. The default kernel for one-class is Gaussian (rbf).

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #48: SVM: Novelty detection using 1-class SVM

Posted by mktal <gi...@git.apache.org>.
Github user mktal commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/48#discussion_r69027307
  
    --- Diff: src/ports/postgres/modules/svm/svm.py_in ---
    @@ -676,27 +985,32 @@ def svm_help(schema_madlib, message, is_svc, **kwargs):
     def svm(schema_madlib, source_table, model_table,
             dependent_varname, independent_varname, kernel_func,
             kernel_params, grouping_col, params, is_svc,
    -        verbose, detect_novelty=False, **kwargs):
    +        verbose, add_intercept=False, **kwargs):
    --- End diff --
    
    I like that. It is cleaner for the code and easier to understand for the users. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #48: SVM: Novelty detection using 1-class SVM

Posted by mktal <gi...@git.apache.org>.
Github user mktal commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/48#discussion_r68995358
  
    --- Diff: src/ports/postgres/modules/svm/svm.py_in ---
    @@ -676,27 +985,32 @@ def svm_help(schema_madlib, message, is_svc, **kwargs):
     def svm(schema_madlib, source_table, model_table,
             dependent_varname, independent_varname, kernel_func,
             kernel_params, grouping_col, params, is_svc,
    -        verbose, detect_novelty=False, **kwargs):
    +        verbose, add_intercept=False, **kwargs):
         """
         Executes the linear support vector classification algorithm.
         """
         # verbosing
    -    verbosity_level = "info" if verbose else "error"
    +    verbosity_level = "warning" if verbose else "error"
         with MinWarning(verbosity_level):
             _verify_table(source_table, model_table,
                           dependent_varname, independent_varname)
             grouping_str, grouping_col = \
                 _get_grouping_col_str(schema_madlib, source_table, grouping_col)
             kernel_func = _get_kernel_name(kernel_func)
    -        transformer = _random_feature_map(schema_madlib, source_table,
    +        transformer = _transform_w_kernel(schema_madlib, source_table,
                                               dependent_varname, independent_varname,
    -                                          kernel_func, kernel_params, grouping_col)
    +                                          kernel_func, kernel_params,
    +                                          grouping_col, add_intercept
    +                                          )
             params_dict = _extract_params(schema_madlib, params)
             args = locals()
    -        if transformer is not None:
    +        if transformer and transformer.transformed_table:
    --- End diff --
    
    There are a few places where we check against `none` for `transformer` and there are places such as in `_build_output_tables` where we assume `transformer` is always a valid object. I think it is safe to assume it is always a valid object since we enforce it in `_get_kernel_name` and `_transform_w_kernel`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib issue #48: SVM: Novelty detection using 1-class SVM

Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on the issue:

    https://github.com/apache/incubator-madlib/pull/48
  
    @mktal 
    Had to make considerable changes for two reasons: 
    1. With cross validation, the transformer was not being updated with the new CV data. That meant having two ways of outputing tables: using transformer or directly using args. 
    2. Cross validation for SVM one class had another issue: the origin in the transformed table was added before the CV, which meant some (most) of the train data did not have the origin. I changed it to do the origin insert within `svm_parsed_params` - that would ensure origin exists in every CV dataset. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #48: SVM: Novelty detection using 1-class SVM

Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/48#discussion_r68995716
  
    --- Diff: src/ports/postgres/modules/svm/svm.py_in ---
    @@ -676,27 +985,32 @@ def svm_help(schema_madlib, message, is_svc, **kwargs):
     def svm(schema_madlib, source_table, model_table,
             dependent_varname, independent_varname, kernel_func,
             kernel_params, grouping_col, params, is_svc,
    -        verbose, detect_novelty=False, **kwargs):
    +        verbose, add_intercept=False, **kwargs):
         """
         Executes the linear support vector classification algorithm.
         """
         # verbosing
    -    verbosity_level = "info" if verbose else "error"
    +    verbosity_level = "warning" if verbose else "error"
         with MinWarning(verbosity_level):
             _verify_table(source_table, model_table,
                           dependent_varname, independent_varname)
             grouping_str, grouping_col = \
                 _get_grouping_col_str(schema_madlib, source_table, grouping_col)
             kernel_func = _get_kernel_name(kernel_func)
    -        transformer = _random_feature_map(schema_madlib, source_table,
    +        transformer = _transform_w_kernel(schema_madlib, source_table,
                                               dependent_varname, independent_varname,
    -                                          kernel_func, kernel_params, grouping_col)
    +                                          kernel_func, kernel_params,
    +                                          grouping_col, add_intercept
    +                                          )
             params_dict = _extract_params(schema_madlib, params)
             args = locals()
    -        if transformer is not None:
    +        if transformer and transformer.transformed_table:
    --- End diff --
    
    IIRC these were here for cross validation, since the transformer was being set to None over there. I did change that to use a `no-op` transformer, so yes, we can remove the transformer check. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #48: SVM: Novelty detection using 1-class SVM

Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/48#discussion_r68996034
  
    --- Diff: src/ports/postgres/modules/svm/svm.py_in ---
    @@ -676,27 +985,32 @@ def svm_help(schema_madlib, message, is_svc, **kwargs):
     def svm(schema_madlib, source_table, model_table,
             dependent_varname, independent_varname, kernel_func,
             kernel_params, grouping_col, params, is_svc,
    -        verbose, detect_novelty=False, **kwargs):
    +        verbose, add_intercept=False, **kwargs):
    --- End diff --
    
    `fit_intercept` has been added to `kernel_params`. We can still keep the `array[1, ...]` usage for non-kernel case since we anyways need that in other modules. The add_intercept here is only to override the user-provided param. I'll change this to use `add_intercept=None` and add a note to make it clearer. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #48: SVM: Novelty detection using 1-class SVM

Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/48#discussion_r69005711
  
    --- Diff: src/ports/postgres/modules/svm/svm.py_in ---
    @@ -676,27 +985,32 @@ def svm_help(schema_madlib, message, is_svc, **kwargs):
     def svm(schema_madlib, source_table, model_table,
             dependent_varname, independent_varname, kernel_func,
             kernel_params, grouping_col, params, is_svc,
    -        verbose, detect_novelty=False, **kwargs):
    +        verbose, add_intercept=False, **kwargs):
    --- End diff --
    
    Agreed, the kernel won't add an intercept even if the independent variables has a 1, unless `fit_intercept=true`, which is the default. The difference is that `array[1, ...]` is an intercept in feature space and `fit_intercept` creates intercept in kernel space. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #48: SVM: Novelty detection using 1-class SVM

Posted by mktal <gi...@git.apache.org>.
Github user mktal commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/48#discussion_r68993568
  
    --- Diff: src/ports/postgres/modules/svm/svm.py_in ---
    @@ -676,27 +985,32 @@ def svm_help(schema_madlib, message, is_svc, **kwargs):
     def svm(schema_madlib, source_table, model_table,
             dependent_varname, independent_varname, kernel_func,
             kernel_params, grouping_col, params, is_svc,
    -        verbose, detect_novelty=False, **kwargs):
    +        verbose, add_intercept=False, **kwargs):
    --- End diff --
    
    It occurs to me that `add_intercept` will always be `False` here since we are not exposing it to the users and no client code is calling this function. A relevant question to think about is whether we should add `fit_intercept` to the `params` and discourage the use of `array[..., 1]`.   


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #48: SVM: Novelty detection using 1-class SVM

Posted by mktal <gi...@git.apache.org>.
Github user mktal commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/48#discussion_r69005568
  
    --- Diff: src/ports/postgres/modules/svm/svm.py_in ---
    @@ -676,27 +985,32 @@ def svm_help(schema_madlib, message, is_svc, **kwargs):
     def svm(schema_madlib, source_table, model_table,
             dependent_varname, independent_varname, kernel_func,
             kernel_params, grouping_col, params, is_svc,
    -        verbose, detect_novelty=False, **kwargs):
    +        verbose, add_intercept=False, **kwargs):
    --- End diff --
    
    I agree that it is probably ok for now that we discourage the use of `array[1, ...]` only in the kernel case and ask users to specify `fit_intercept` in the kernel params. We will also need to add notes in the doc to make it clear to the users.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #48: SVM: Novelty detection using 1-class SVM

Posted by mktal <gi...@git.apache.org>.
Github user mktal commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/48#discussion_r68996075
  
    --- Diff: src/ports/postgres/modules/svm/svm.py_in ---
    @@ -144,30 +145,37 @@ def _verify_get_params_dict(params_dict):
     
     
     def _build_output_tables(n_iters_run, model_table, args, transformer, **kwargs):
    -    if transformer is None:
    +    # transformer should always be a valid object
    +    original_table = transformer.original_table
    +    if not original_table:
    --- End diff --
    
    I agree we should add a note to ensure that future kernels never set it to empty. It also makes sense that `original_table` should always be non-empty otherwise there is nothing to train from.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #48: SVM: Novelty detection using 1-class SVM

Posted by mktal <gi...@git.apache.org>.
Github user mktal commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/48#discussion_r68990844
  
    --- Diff: src/ports/postgres/modules/svm/svm.py_in ---
    @@ -144,30 +145,37 @@ def _verify_get_params_dict(params_dict):
     
     
     def _build_output_tables(n_iters_run, model_table, args, transformer, **kwargs):
    -    if transformer is None:
    +    # transformer should always be a valid object
    +    original_table = transformer.original_table
    +    if not original_table:
    --- End diff --
    
    As far as I understand, `original_table` will always be a dict object such that `not original_table` will always return `false`. Let us remove the `if` branch if that is case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #48: SVM: Novelty detection using 1-class SVM

Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/48#discussion_r69024778
  
    --- Diff: src/ports/postgres/modules/svm/svm.py_in ---
    @@ -676,27 +985,32 @@ def svm_help(schema_madlib, message, is_svc, **kwargs):
     def svm(schema_madlib, source_table, model_table,
             dependent_varname, independent_varname, kernel_func,
             kernel_params, grouping_col, params, is_svc,
    -        verbose, detect_novelty=False, **kwargs):
    +        verbose, add_intercept=False, **kwargs):
    --- End diff --
    
    Two good points. 
    1. I'll rename `add_intercept` to `override_fit_intercept`. That way it's clear that `fit_intercept` is being overwritten when it's true. 
    2. Now that I think about it, we don't need to tell the user to use `array[1, ...]` at all, since `fit_intercept` will take care of it both in the case of linear and non-linear SVM. The user can set `fit_intercept=False` if intercept is not desired but default is to always include intercept. 
    How does that sound? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib issue #48: SVM: Novelty detection using 1-class SVM

Posted by mktal <gi...@git.apache.org>.
Github user mktal commented on the issue:

    https://github.com/apache/incubator-madlib/pull/48
  
    LGTM. One suggestion though is to move `fit_intercept` from `kernel_params` to `params` since `fit_intercept` is more of a global params rather than specific to any one kernel. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #48: SVM: Novelty detection using 1-class SVM

Posted by mktal <gi...@git.apache.org>.
Github user mktal commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/48#discussion_r69011515
  
    --- Diff: src/ports/postgres/modules/svm/svm.py_in ---
    @@ -676,27 +985,32 @@ def svm_help(schema_madlib, message, is_svc, **kwargs):
     def svm(schema_madlib, source_table, model_table,
             dependent_varname, independent_varname, kernel_func,
             kernel_params, grouping_col, params, is_svc,
    -        verbose, detect_novelty=False, **kwargs):
    +        verbose, add_intercept=False, **kwargs):
    --- End diff --
    
    One thing still not very clear to me. `add_intercept` being true override `fit_intercept` which is intended. But `add_intercept` being false will have no effect on `fit_intercept`. Also if user uses `array[1, ...]` and linear kernel and specify `fit_intercept=true`  in kernel params, the algorithm will essentially fit `array[1,1,...]`? I thought the behavior in the linear case is such that `fit_intercept` in the kernel params is a `no-op` while in the kernel case `fit_intercept` overrides. There are some inconsistencies due to multiple ways of specifying whether to fit intercept. I am a bit confused and need to think about it more.... 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #48: SVM: Novelty detection using 1-class SVM

Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/48#discussion_r68994973
  
    --- Diff: src/ports/postgres/modules/svm/svm.py_in ---
    @@ -144,30 +145,37 @@ def _verify_get_params_dict(params_dict):
     
     
     def _build_output_tables(n_iters_run, model_table, args, transformer, **kwargs):
    -    if transformer is None:
    +    # transformer should always be a valid object
    +    original_table = transformer.original_table
    +    if not original_table:
    --- End diff --
    
    Do you mean `original_table` is always a non-empty dictionary? If that is a restriction, then we'll have to add a note in the `kernel_approximation` classes ensuring that future kernels never set `original_table` to empty. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #48: SVM: Novelty detection using 1-class SVM

Posted by mktal <gi...@git.apache.org>.
Github user mktal commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/48#discussion_r69003291
  
    --- Diff: src/ports/postgres/modules/svm/svm.py_in ---
    @@ -676,27 +985,32 @@ def svm_help(schema_madlib, message, is_svc, **kwargs):
     def svm(schema_madlib, source_table, model_table,
             dependent_varname, independent_varname, kernel_func,
             kernel_params, grouping_col, params, is_svc,
    -        verbose, detect_novelty=False, **kwargs):
    +        verbose, add_intercept=False, **kwargs):
    --- End diff --
    
    The main issue with `array[1, ...]` is when kernel is used and kernel does not know whether to fit intercept unless we parse `array[1, ...]`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib issue #48: SVM: Novelty detection using 1-class SVM

Posted by mktal <gi...@git.apache.org>.
Github user mktal commented on the issue:

    https://github.com/apache/incubator-madlib/pull/48
  
    @iyerr3 Thanks for the updates.  I will review the changes and let you know. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #48: SVM: Novelty detection using 1-class SVM

Posted by mktal <gi...@git.apache.org>.
Github user mktal commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/48#discussion_r69007359
  
    --- Diff: src/ports/postgres/modules/svm/svm.py_in ---
    @@ -676,27 +985,32 @@ def svm_help(schema_madlib, message, is_svc, **kwargs):
     def svm(schema_madlib, source_table, model_table,
             dependent_varname, independent_varname, kernel_func,
             kernel_params, grouping_col, params, is_svc,
    -        verbose, detect_novelty=False, **kwargs):
    +        verbose, add_intercept=False, **kwargs):
         """
         Executes the linear support vector classification algorithm.
         """
         # verbosing
    -    verbosity_level = "info" if verbose else "error"
    +    verbosity_level = "warning" if verbose else "error"
         with MinWarning(verbosity_level):
             _verify_table(source_table, model_table,
                           dependent_varname, independent_varname)
             grouping_str, grouping_col = \
                 _get_grouping_col_str(schema_madlib, source_table, grouping_col)
             kernel_func = _get_kernel_name(kernel_func)
    -        transformer = _random_feature_map(schema_madlib, source_table,
    +        transformer = _transform_w_kernel(schema_madlib, source_table,
                                               dependent_varname, independent_varname,
    -                                          kernel_func, kernel_params, grouping_col)
    +                                          kernel_func, kernel_params,
    +                                          grouping_col, add_intercept
    +                                          )
             params_dict = _extract_params(schema_madlib, params)
             args = locals()
    -        if transformer is not None:
    +        if transformer and transformer.transformed_table:
    --- End diff --
    
    +1 on `no-op` transformer. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #48: SVM: Novelty detection using 1-class SVM

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/incubator-madlib/pull/48


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---