You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@madlib.apache.org by fm...@apache.org on 2019/10/07 16:49:07 UTC
[madlib] branch master updated: SVM: Lower bound the default for n_components

This is an automated email from the ASF dual-hosted git repository.

fmcquillan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/madlib.git


The following commit(s) were added to refs/heads/master by this push:
     new 1b5ba4a  SVM: Lower bound the default for n_components
1b5ba4a is described below

commit 1b5ba4afd58ca9b263ccac47769ef281b45e3466
Author: Orhan Kislal <ok...@apache.org>
AuthorDate: Fri Oct 4 14:45:06 2019 -0400

    SVM: Lower bound the default for n_components
    
    JIRA: MADLIB-1384
---
 src/ports/postgres/modules/svm/svm.py_in  |  2 +-
 src/ports/postgres/modules/svm/svm.sql_in | 30 ++++++++++++++----------------
 2 files changed, 15 insertions(+), 17 deletions(-)

diff --git a/src/ports/postgres/modules/svm/svm.py_in b/src/ports/postgres/modules/svm/svm.py_in
index b4f4f45..1532cb2 100644
--- a/src/ports/postgres/modules/svm/svm.py_in
+++ b/src/ports/postgres/modules/svm/svm.py_in
@@ -1330,7 +1330,7 @@ def _process_epsilon(is_svc, args):
 def _extract_kernel_params(kernel_params='', n_features=10):
     params_default = {
         # common params
-        'n_components': 2 * n_features,
+        'n_components': max(100, 2 * n_features),
         'fit_intercept': False,
         'random_state': 1,
 
diff --git a/src/ports/postgres/modules/svm/svm.sql_in b/src/ports/postgres/modules/svm/svm.sql_in
index cb6b69e..ba05e86 100644
--- a/src/ports/postgres/modules/svm/svm.sql_in
+++ b/src/ports/postgres/modules/svm/svm.sql_in
@@ -319,23 +319,22 @@ to the end of the feature list - thus the last element of the coefficient list
 is the intercept.
 </DD>
 <DT>n_components</DT>
-<DD>Default: 2*num_features. The dimensionality of the transformed feature space.
+<DD>Default: max(100, 2*num_features). The dimensionality of the transformed feature space.
 A larger value lowers the variance of the estimate of the kernel but requires
 more memory and takes longer to train.</DD>
 @note
-Setting the \e n_components kernel parameter properly is important
-to generate an accurate decision boundary.  This parameter
-is the dimensionality of the transformed feature space that arises
-from using the primal formulation.  We use primal in MADlib
-because we are implementing in a distributed system,
-compared to an R or other single node implementations
-that can use the dual formulation.  The primal approach
-implements an approximation of the kernel function using random
-feature maps, so in the case of a gaussian kernel, the
-dimensionality of the transformed feature space is not
-infinite (as in dual), but rather of size \e n_components.
-Try increasing \e n_components higher than the default if you are
-not getting an accurate decision boundary.
+Setting the \e n_components kernel parameter properly is important to
+generate an accurate decision boundary and can make the difference between a
+good model and a useless model. Try increasing the value of \e n_components
+ if you are not getting an accurate decision boundary. This parameter arises
+from using the primal formulation, in which we map data into a relatively
+low-dimensional randomized feature space [2, 3]. The parameter
+\e n_components is the dimension of that feature space.  We use the primal in
+MADlib to support scaling to large data sets, compared to R or other single
+node implementations  that use the dual formulation and hence do not have this
+type of mapping, since the the dimensionality of  the transformed feature
+space in the dual is effectively infinite.
+
 <DT>random_state</DT>
 <DD>Default: 1. Seed used by a random number generator. </DD>
 </DL>
@@ -641,8 +640,7 @@ WHERE houses_pred.prediction != (houses.price < 100000);
 -# Train using Gaussian kernel. This time we specify
 the initial step size and maximum number of iterations to run. As part of the
 kernel parameter, we choose 10 as the dimension of the space where we train
-SVM. A larger number will lead to a more powerful model but run the risk of
-overfitting. As a result, the model will be a 10 dimensional vector, instead
+SVM. As a result, the model will be a 10 dimensional vector, instead
 of 4 as in the case of linear model.
 <pre class="example">
 DROP TABLE IF EXISTS houses_svm_gaussian, houses_svm_gaussian_summary, houses_svm_gaussian_random;