You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by debasish83 <gi...@git.apache.org> on 2014/10/08 05:49:57 UTC

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-2426: Quadratic Minimizati...

GitHub user debasish83 opened a pull request:

    https://github.com/apache/spark/pull/2705

    [MLLIB] [WIP] SPARK-2426: Quadratic Minimization for MLlib ALS

    ALS is a generic algorithm for matrix factorization which is equally applicable for both feature space and similarity space. Current ALS support L2 regularization and positivity constraint. This PR introduces userConstraint and productConstraint to ALS and let's the user select different constraints for user and product solves. The supported constraints are the following:
    
    1. SMOOTH : default ALS with L2 regularization
    2. POSITIVE: ALS with positive factors
    3. BOUNDS: ALS with factors bounded within upper and lower bound (default within 0 and 1)
    4. SPARSE: ALS with L1 regularization
    5. EQUALITY: ALS with equality constraint (default the factors sum up to 1 and positive)
    
    First let's focus on the problem formulation. Both implicit and explicit feedback ALS formulation can be written as a quadratic minimization problem. The quadratic objective can be written as xtHx + ctx. Each of the respective constraints take the following form:
    minimize xtHx + ctx
    s.t ||x||1 <= c (SPARSE constraint)
    
    We rewrite the objective as f(x) = xtHx + ctx and the constraint as an indicator function g(x)
    
    Now minimization of f(x) + g(x) can be carried out using various forward backward splitting algorithms. We choose ADMM for the first version based on our experimentation with ECOS IP solver and MOSEK comparisons. I will document the comparisons.
    
    Details of the algorithm are in the following reference:
    http://web.stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf
    
    Right now the default parameters of alpha, rho are set as 1.0 but the following issues show up in experiments with MovieLens dataset:
    1. ~3X higher iterations as compared to NNLS
    2. For SPARSE we are hitting the max iterations (400) around 10% of the time
    3. For EQUALITY rho is set at 50 based on a reference from Professor Boyd on optimal control
    
    We choose ADMM as the baseline solver but this PR will explore the following solver enhancements to decrease the iteration count:
    1. Accelerated ADMM using Nesterov acceleration
    2. FISTA style forward backward splitting
    
    For use-cases the PR is focused on the following:
    
    1. Sparse matrix factorization to improve recommendation 
    On Movielens data right now the RMSE with SPARSE is 10% (1.04) lower than the Mahout/Spark baseline (0.9) but have not looked into map, prec@k and ndcg@k measures. Using the PR from @coderxiang to look into IR measures.
    Example run:
    MASTER=spark://localhost:7077 ./bin/run-example mllib.MovieLensALS --rank 20 --numIterations 10 --userConstraint SMOOTH --lambdaUser 0.065 --productConstraint SPARSE --lambdaProduct 0.1 --kryo hdfs://localhost:8020/sandbox/movielens/
       
    2. Topic modeling using LSA
    References:
    2007 Sparse coding: papers.nips.cc/paper/2979-efficient-sparse-coding-algorithms.pdf
    2012 Sparse Coding + MR/MPI Microsoft: http://web.stanford.edu/group/mmds/slides2012/s-hli.pdf
    Implementing the 20NG flow to validate the sparse coding result improvement over LDA based topic modeling.
    
    3. Topic modeling using PLSA
    Reference: 
    Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization
    The EQUALITY formulation with a Quadratic loss is an approximation to the KL divergence loss being used in PLSA. We are interested to see if it improves the result further as compared to the Sparse coding.
    
    Next steps:
    1. Improve the convergence rate of forward-backward splitting on quadratic problems
    2. Move the test-cases to QuadraticMinimizerSuite.scala
    3. Generate results for each of the use-cases and add tests related to each use-case
    
    Related future PRs:
    1. Scale the factorization rank and remove the need to construct H matrix
    2. Replace the quadratic loss xtHx + ctx with a Convex loss

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/debasish83/spark qp-als

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2705.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2705
    
----
commit 0b3f0530702b7ca54b5152a3b65530113b2d538c
Author: Debasish Das <de...@one.verizon.com>
Date:   2014-06-13T03:24:24Z

    jecos integrated as the default qpsolve in ALS; implicit tests are failing

commit 8ba4871ed44c5e971dddd1888be34fb5f950bf76
Author: Debasish Das <de...@one.verizon.com>
Date:   2014-06-18T04:30:12Z

    Qp options added to Spark ALS: unbounded Qp, Qp with pos, Qp with bounds, Qp with smoothness, Qp with L1

commit dd912db0966c28501948cdd27e5271ff9b10a8c5
Author: Debasish Das <de...@one.verizon.com>
Date:   2014-06-19T08:19:06Z

    Prepared branch of ALS-QP feature/runtime testing

commit 6dd320b170ffe7d6f6487bfe686192a672fe20c9
Author: Debasish Das <de...@one.verizon.com>
Date:   2014-06-20T23:01:21Z

    QpProblem drivers in MovieLensALS

commit 48023c84fd4896899a2d0dfd25f7c33234551b22
Author: Debasish Das <de...@one.verizon.com>
Date:   2014-06-21T01:23:41Z

    debug option for octave quadprog validation

commit e7e64b7741c001735b85b25bca0dc6cea534bed3
Author: Debasish Das <de...@one.verizon.com>
Date:   2014-06-21T07:50:18Z

    L1 option added to ALS; Driver added to MovieLensALS

commit 84f1d67242e4ac6a7846f60c042269e31a57894a
Author: Debasish Das <de...@one.verizon.com>
Date:   2014-06-21T09:33:02Z

    Qp with equality and bounds added to option 4 of ECOS based QpSolver

commit 90bca10bcbf255135b4a9ec5b68b848133298dc7
Author: Debasish Das <de...@one.verizon.com>
Date:   2014-06-30T22:59:08Z

    Movielens runtime experiments for Spark Summit talk

commit 4e2c6235b52dd81a8a2e74182b2751c389201158
Author: Debasish Das <de...@one.verizon.com>
Date:   2014-07-15T15:28:43Z

    ADMM based QuadraticMinimizer in mllib.optimization;Used in ALS

commit 3f93ee5741be230e6a0fcab1bac9459243e9355e
Author: Debasish Das <de...@one.verizon.com>
Date:   2014-07-16T01:50:55Z

    Refactored to use com.github.ecos package

commit f2888465677028fd1e85eb3f591b2537514529a2
Author: Debasish Das <de...@one.verizon.com>
Date:   2014-07-18T02:43:30Z

    moved interior point based qp-als to feature/ipmqp-als branch; preparing for distributed runs; rho=50 for equality constraint, default rho=1.0, alpha = 1.0 (no over-relaxation) for convergence study

commit 21d79901ebb6e9399ebf192afda3e5cbbe782f31
Author: Debasish Das <de...@one.verizon.com>
Date:   2014-08-02T07:15:34Z

    license cleanup; Copyright added to NOTICE

commit 13cb89b27963868227e940fd946e8faadfa32cb1
Author: Debasish Das <de...@one.verizon.com>
Date:   2014-08-05T05:57:48Z

    Merge with HEAD

commit a12d92a3c3a950bd8782592cb8c797199aa1fdfc
Author: Debasish Das <de...@one.verizon.com>
Date:   2014-08-07T21:36:25Z

    BSD license for Proximal algorithms

commit 02199a8939f421c10ffc688bfb0ce1e1a908e369
Author: Debasish Das <de...@one.verizon.com>
Date:   2014-08-08T18:19:30Z

    LICENSE and NOTICE updates as per Legal

commit f43ed66127781a5e6669e059686eb4cb9c5c2e28
Author: Debasish Das <de...@one.verizon.com>
Date:   2014-08-09T05:45:47Z

    Merge with master

commit c03dbeda9b3b1d0f182d3c1450bb1e8b3e0c9af3
Author: Debasish Das <de...@one.verizon.com>
Date:   2014-08-13T06:42:56Z

    Merge branch 'feature/qp-als' of https://istg.vzvisp.com:8443/stash/scm/bda/spark into qp-als

commit c9d1fbf88337058f59fea958fb8f1aa17ea92c74
Author: Debasish Das <de...@one.verizon.com>
Date:   2014-10-08T03:01:36Z

    Redesign of ALS API; userConstraint and productConstraint separated;

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-2426: Quadratic Minimizati...

Posted by debasish83 <gi...@git.apache.org>.

Github user debasish83 commented on the pull request:

    https://github.com/apache/spark/pull/2705#issuecomment-58863216
  
    @Chanda breeze sparse matrix does not solve your problem since breeze does not have sparse LDL but the ECOS jar has the ldl and amd native libraries which we will use for sparse LDL...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-2426: Quadratic Minimizati...

Posted by Chanda <gi...@git.apache.org>.

Github user Chanda commented on the pull request:

    https://github.com/apache/spark/pull/2705#issuecomment-58612069
  
    Related Future PR:  In mlib QP Solver (QuadraticMinimizer.scala)
    3.Replace dense gram matrix to sparse. (Move from jblas dense matrix to breeze sparse matrix).
    4.Also add inequality constraints to the QP now it only has equality constraints. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-2426: Quadratic Minimizati...

Posted by debasish83 <gi...@git.apache.org>.

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/2705#issuecomment-61211629

@Chanda you can see how to solve equality and bounds in QuadraticMinimizer.scala...

Dual of Kernel SVM can be solved using this formulation as long as rank don't grow beyond 4000...For larger ranks I am working on a iterative version....

Here are some example runs:

./bin/spark-class org.apache.spark.mllib.optimization.QuadraticMinimizer 1000 1 1.0 0.99

Inputs are as follows:

rank: 1000 equality: 1 lambda: 1.0 elasticNet beta 0.99

Randomly we generate 1000x1000 dense gram and 1 equality constraint of the form
alpha1*x1 + alpha2*x2 + ... = b2 (same as hyperplane constraint) with randomly generated upper and lower bounds...The problem is as follows:

min x'Hx + c'tx
s.t Ax = b, lb <= x <= ub

You can also generate more interest equality constraints...

lambda and beta is for elastic net so you should not concern with that...

The output shows the runtime. You are more interested in the last one QpEquality...

Generating randomized QPs with rank 1000 equalities 1
Qp Equality 2875.355 ms iterations 2237 converged true

For 1000x1000 it takes 3 seconds...Note that code is not well optimized yet but we tie with Mosek runtime (which is a IP solver)...More data on MOSEK comparisons I will add later tonight...

You can partition your data and do a local kernel SVM on each worker followed by an averaging step on the master if you are focused on solving the dual kernel svm....

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-2426: Quadratic Minimizati...

Posted by debasish83 <gi...@git.apache.org>.

Github user debasish83 closed the pull request at:

    https://github.com/apache/spark/pull/2705


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-2426: Quadratic Minimizati...

Posted by debasish83 <gi...@git.apache.org>.

GitHub user debasish83 reopened a pull request:

    https://github.com/apache/spark/pull/2705

    [MLLIB] [WIP] SPARK-2426: Quadratic Minimization for MLlib ALS

    ALS is a generic algorithm for matrix factorization which is equally applicable for both feature space and similarity space. Current ALS support L2 regularization and positivity constraint. This PR introduces userConstraint and productConstraint to ALS and let's the user select different constraints for user and product solves. The supported constraints are the following:
    
    1. SMOOTH : default ALS with L2 regularization
    2. POSITIVE: ALS with positive factors
    3. BOUNDS: ALS with factors bounded within upper and lower bound (default within 0 and 1)
    4. SPARSE: ALS with L1 regularization
    5. EQUALITY: ALS with equality constraint (default the factors sum up to 1 and positive)
    
    First let's focus on the problem formulation. Both implicit and explicit feedback ALS formulation can be written as a quadratic minimization problem. The quadratic objective can be written as xtHx + ctx. Each of the respective constraints take the following form:
    minimize xtHx + ctx
    s.t ||x||1 <= c (SPARSE constraint)
    
    We rewrite the objective as f(x) = xtHx + ctx and the constraint as an indicator function g(x)
    
    Now minimization of f(x) + g(x) can be carried out using various forward backward splitting algorithms. We choose ADMM for the first version based on our experimentation with ECOS IP solver and MOSEK comparisons. I will document the comparisons.
    
    Details of the algorithm are in the following reference:
    http://web.stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf
    
    Right now the default parameters of alpha, rho are set as 1.0 but the following issues show up in experiments with MovieLens dataset:
    1. ~3X higher iterations as compared to NNLS
    2. For SPARSE we are hitting the max iterations (400) around 10% of the time
    3. For EQUALITY rho is set at 50 based on a reference from Professor Boyd on optimal control
    
    We choose ADMM as the baseline solver but this PR will explore the following solver enhancements to decrease the iteration count:
    1. Accelerated ADMM using Nesterov acceleration
    2. FISTA style forward backward splitting
    
    For use-cases the PR is focused on the following:
    
    1. Sparse matrix factorization to improve recommendation 
    On Movielens data right now the RMSE with SPARSE is 10% (1.04) lower than the Mahout/Spark baseline (0.9) but have not looked into map, prec@k and ndcg@k measures. Using the PR from @coderxiang to look into IR measures.
    Example run:
    MASTER=spark://localhost:7077 ./bin/run-example mllib.MovieLensALS --rank 20 --numIterations 10 --userConstraint SMOOTH --lambdaUser 0.065 --productConstraint SPARSE --lambdaProduct 0.1 --kryo hdfs://localhost:8020/sandbox/movielens/
       
    2. Topic modeling using LSA
    References:
    2007 Sparse coding: papers.nips.cc/paper/2979-efficient-sparse-coding-algorithms.pdf
    2011 Sparse Latent Semantic Analysis LSA(some of it is implemented in Graphlab): 
    https://www.cs.cmu.edu/~xichen/images/SLSA-sdm11-final.pdf
    2012 Sparse Coding + MR/MPI Microsoft: http://web.stanford.edu/group/mmds/slides2012/s-hli.pdf
    Implementing the 20NG flow to validate the sparse coding result improvement over LDA based topic modeling.
    
    3. Topic modeling using PLSA
    Reference: 
    Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization
    The EQUALITY formulation with a Quadratic loss is an approximation to the KL divergence loss being used in PLSA. We are interested to see if it improves the result further as compared to the Sparse coding.
    
    Next steps:
    1. Improve the convergence rate of forward-backward splitting on quadratic problems
    2. Move the test-cases to QuadraticMinimizerSuite.scala
    3. Generate results for each of the use-cases and add tests related to each use-case
    
    Related future PRs:
    1. Scale the factorization rank and remove the need to construct H matrix
    2. Replace the quadratic loss xtHx + ctx with a Convex loss

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/debasish83/spark qp-als

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2705.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2705
    
----
commit 9c439d33160ef3b31173381735dfa8cfb7d552ba
Author: Xiangrui Meng <me...@databricks.com>
Date:   2014-10-09T05:35:14Z

    [SPARK-3856][MLLIB] use norm operator after breeze 0.10 upgrade
    
    Got warning msg:
    
    ~~~
    [warn] /Users/meng/src/spark/mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala:50: method norm in trait NumericOps is deprecated: Use norm(XXX) instead of XXX.norm
    [warn]     var norm = vector.toBreeze.norm(p)
    ~~~
    
    dbtsai
    
    Author: Xiangrui Meng <me...@databricks.com>
    
    Closes #2718 from mengxr/SPARK-3856 and squashes the following commits:
    
    4f38169 [Xiangrui Meng] use norm operator

commit b9df8af62e8d7b263a668dfb6e9668ab4294ea37
Author: Anand Avati <av...@redhat.com>
Date:   2014-10-09T06:45:17Z

    [SPARK-2805] Upgrade to akka 2.3.4
    
    Upgrade to akka 2.3.4
    
    Author: Anand Avati <av...@redhat.com>
    
    Closes #1685 from avati/SPARK-1812-akka-2.3 and squashes the following commits:
    
    57a2315 [Anand Avati] SPARK-1812: streaming - remove tests which depend on akka.actor.IO
    2a551d3 [Anand Avati] SPARK-1812: core - upgrade to akka 2.3.4

commit 86b392942daf61fed2ff7490178b128107a0e856
Author: Xiangrui Meng <me...@databricks.com>
Date:   2014-10-09T07:00:24Z

    [SPARK-3844][UI] Truncate appName in WebUI if it is too long
    
    Truncate appName in WebUI if it is too long.
    
    Author: Xiangrui Meng <me...@databricks.com>
    
    Closes #2707 from mengxr/truncate-app-name and squashes the following commits:
    
    87834ce [Xiangrui Meng] move scala import below java
    c7111dc [Xiangrui Meng] truncate appName in WebUI if it is too long

commit 13cab5ba44e2f8d2d2204b3b0d39d7c23a819bdb
Author: nartz <na...@gmail.com>
Date:   2014-10-09T07:02:11Z

    add spark.driver.memory to config docs
    
    It took me a minute to track this down, so I thought it could be useful to have it in the docs.
    
    I'm unsure if 512mb is the default for spark.driver.memory? Also - there could be a better value for the 'description' to differentiate it from spark.executor.memory.
    
    Author: nartz <na...@gmail.com>
    Author: Nathan Artz <na...@Nathans-MacBook-Pro.local>
    
    Closes #2410 from nartz/docs/add-spark-driver-memory-to-config-docs and squashes the following commits:
    
    a2f6c62 [nartz] Update configuration.md
    74521b8 [Nathan Artz] add spark.driver.memory to config docs

commit 14f222f7f76cc93633aae27a94c0e556e289ec56
Author: Qiping Li <li...@gmail.com>
Date:   2014-10-09T08:36:58Z

    [SPARK-3158][MLLIB]Avoid 1 extra aggregation for DecisionTree training
    
    Currently, the implementation does one unnecessary aggregation step. The aggregation step for level L (to choose splits) gives enough information to set the predictions of any leaf nodes at level L+1. We can use that info and skip the aggregation step for the last level of the tree (which only has leaf nodes).
    
    ### Implementation Details
    
    Each node now has a `impurity` field and the `predict` is changed from type `Double` to type `Predict`(this can be used to compute predict probability in the future) When compute best splits for each node, we also compute impurity and predict for the child nodes, which is used to constructed newly allocated child nodes. So at level L, we have set impurity and predict for nodes at level L +1.
    If level L+1 is the last level, then we can avoid aggregation. What's more, calculation of parent impurity in
    
    Top nodes for each tree needs to be treated differently because we have to compute impurity and predict for them first. In `binsToBestSplit`, if current node is top node(level == 0), we calculate impurity and predict first.
    after finding best split, top node's predict and impurity is set to the calculated value. Non-top nodes's impurity and predict are already calculated and don't need to be recalculated again. I have considered to add a initialization step to set top nodes' impurity and predict and then we can treat all nodes in the same way, but this will need a lot of duplication of code(all the code to do seq operation(BinSeqOp) needs to be duplicated), so I choose the current way.
    
     CC mengxr manishamde jkbradley, please help me review this, thanks.
    
    Author: Qiping Li <li...@gmail.com>
    
    Closes #2708 from chouqin/avoid-agg and squashes the following commits:
    
    8e269ea [Qiping Li] adjust code and comments
    eefeef1 [Qiping Li] adjust comments and check child nodes' impurity
    c41b1b6 [Qiping Li] fix pyspark unit test
    7ad7a71 [Qiping Li] fix unit test
    822c912 [Qiping Li] add comments and unit test
    e41d715 [Qiping Li] fix bug in test suite
    6cc0333 [Qiping Li] SPARK-3158: Avoid 1 extra aggregation for DecisionTree training

commit 1e0aa4deba65aa1241b9a30edb82665eae27242f
Author: GuoQiang Li <wi...@qq.com>
Date:   2014-10-09T16:22:32Z

    [Minor] use norm operator after breeze 0.10 upgrade
    
    cc mengxr
    
    Author: GuoQiang Li <wi...@qq.com>
    
    Closes #2730 from witgo/SPARK-3856 and squashes the following commits:
    
    2cffce1 [GuoQiang Li] use norm operator after breeze 0.10 upgrade

commit 73bf3f2e0c03216aa29c25fea2d97205b5977903
Author: zsxwing <zs...@gmail.com>
Date:   2014-10-09T18:27:21Z

    [SPARK-3741] Make ConnectionManager propagate errors properly and add mo...
    
    ...re logs to avoid Executors swallowing errors
    
    This PR made the following changes:
    * Register a callback to `Connection` so that the error will be propagated properly.
    * Add more logs so that the errors won't be swallowed by Executors.
    * Use trySuccess/tryFailure because `Promise` doesn't allow to call success/failure more than once.
    
    Author: zsxwing <zs...@gmail.com>
    
    Closes #2593 from zsxwing/SPARK-3741 and squashes the following commits:
    
    1d5aed5 [zsxwing] Fix naming
    0b8a61c [zsxwing] Merge branch 'master' into SPARK-3741
    764aec5 [zsxwing] [SPARK-3741] Make ConnectionManager propagate errors properly and add more logs to avoid Executors swallowing errors

commit b77a02f41c60d869f48b65e72ed696c05b30bc48
Author: Vida Ha <vi...@databricks.com>
Date:   2014-10-09T20:13:31Z

    [SPARK-3752][SQL]: Add tests for different UDF's
    
    Author: Vida Ha <vi...@databricks.com>
    
    Closes #2621 from vidaha/vida/SPARK-3752 and squashes the following commits:
    
    d7fdbbc [Vida Ha] Add tests for different UDF's

commit 752e90f15e0bb82d283f05eff08df874b48caed9
Author: Yash Datta <ya...@guavus.com>
Date:   2014-10-09T19:59:14Z

    [SPARK-3711][SQL] Optimize where in clause filter queries
    
    The In case class is replaced by a InSet class in case all the filters are literals, which uses a hashset instead of Sequence, thereby giving significant performance improvement (earlier the seq was using a worst case linear match (exists method) since expressions were assumed in the filter list) . Maximum improvement should be visible in case small percentage of large data matches the filter list.
    
    Author: Yash Datta <Ya...@guavus.com>
    
    Closes #2561 from saucam/branch-1.1 and squashes the following commits:
    
    4bf2d19 [Yash Datta] SPARK-3711: 1. Fix code style and import order             2. Fix optimization condition             3. Add tests for null in filter list             4. Add test case that optimization is not triggered in case of attributes in filter list
    afedbcd [Yash Datta] SPARK-3711: 1. Add test cases for InSet class in ExpressionEvaluationSuite             2. Add class OptimizedInSuite on the lines of ConstantFoldingSuite, for the optimized In clause
    0fc902f [Yash Datta] SPARK-3711: UnaryMinus will be handled by constantFolding
    bd84c67 [Yash Datta] SPARK-3711: Incorporate review comments. Move optimization of In clause to Optimizer.scala by adding a rule. Add appropriate comments
    430f5d1 [Yash Datta] SPARK-3711: Optimize the filter list in case of negative values as well
    bee98aa [Yash Datta] SPARK-3711: Optimize where in clause filter queries

commit 2c8851343a2e4d1d5b3a2b959eaa651a92982a72
Author: scwf <wa...@huawei.com>
Date:   2014-10-09T20:22:36Z

    [SPARK-3806][SQL] Minor fix for CliSuite
    
    To fix two issues in CliSuite
    1 CliSuite throw IndexOutOfBoundsException:
    Exception in thread "Thread-6" java.lang.IndexOutOfBoundsException: 6
    	at scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43)
    	at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47)
    	at org.apache.spark.sql.hive.thriftserver.CliSuite.org$apache$spark$sql$hive$thriftserver$CliSuite$$captureOutput$1(CliSuite.scala:67)
    	at org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$4.apply(CliSuite.scala:78)
    	at org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$4.apply(CliSuite.scala:78)
    	at scala.sys.process.ProcessLogger$$anon$1.out(ProcessLogger.scala:96)
    	at scala.sys.process.BasicIO$$anonfun$processOutFully$1.apply(BasicIO.scala:135)
    	at scala.sys.process.BasicIO$$anonfun$processOutFully$1.apply(BasicIO.scala:135)
    	at scala.sys.process.BasicIO$.readFully$1(BasicIO.scala:175)
    	at scala.sys.process.BasicIO$.processLinesFully(BasicIO.scala:179)
    	at scala.sys.process.BasicIO$$anonfun$processFully$1.apply(BasicIO.scala:164)
    	at scala.sys.process.BasicIO$$anonfun$processFully$1.apply(BasicIO.scala:162)
    	at scala.sys.process.ProcessBuilderImpl$Simple$$anonfun$3.apply$mcV$sp(ProcessBuilderImpl.scala:73)
    	at scala.sys.process.ProcessImpl$Spawn$$anon$1.run(ProcessImpl.scala:22)
    
    Actually, it is the Mutil-Threads lead to this problem.
    
    2 Using ```line.startsWith``` instead ```line.contains``` to assert expected answer. This is a tiny bug in CliSuite, for test case "Simple commands", there is a expected answers "5", if we use ```contains``` that means output like "14/10/06 11:```5```4:36 INFO CliDriver: Time taken: 1.078 seconds" or "14/10/06 11:54:36 INFO StatsReportListener: 	0%	```5```%	10%	25%	50%	75%	90%	95%	100%" will make the assert true.
    
    Author: scwf <wa...@huawei.com>
    
    Closes #2666 from scwf/clisuite and squashes the following commits:
    
    11430db [scwf] fix-clisuite

commit e7edb723d22869f228b838fd242bf8e6fe73ee19
Author: cocoatomo <co...@gmail.com>
Date:   2014-10-09T20:46:26Z

    [SPARK-3868][PySpark] Hard to recognize which module is tested from unit-tests.log
    
    ./python/run-tests script display messages about which test it is running currently on stdout but not write them on unit-tests.log.
    It is harder for us to recognize what test programs were executed and which test was failed.
    
    Author: cocoatomo <co...@gmail.com>
    
    Closes #2724 from cocoatomo/issues/3868-display-testing-module-name and squashes the following commits:
    
    c63d9fa [cocoatomo] [SPARK-3868][PySpark] Hard to recognize which module is tested from unit-tests.log

commit ec4d40e48186af18e25517e0474020720645f583
Author: Mike Timper <mi...@aurorafeint.com>
Date:   2014-10-09T21:02:27Z

    [SPARK-3853][SQL] JSON Schema support for Timestamp fields
    
    In JSONRDD.scala, add 'case TimestampType' in the enforceCorrectType function and a toTimestamp function.
    
    Author: Mike Timper <mi...@aurorafeint.com>
    
    Closes #2720 from mtimper/master and squashes the following commits:
    
    9386ab8 [Mike Timper] Fix and tests for SPARK-3853

commit 1faa1135a3fc0acd89f934f01a4a2edefcb93d33
Author: Patrick Wendell <pw...@gmail.com>
Date:   2014-10-09T21:50:36Z

    Revert "[SPARK-2805] Upgrade to akka 2.3.4"
    
    This reverts commit b9df8af62e8d7b263a668dfb6e9668ab4294ea37.

commit 1c7f0ab302de9f82b1bd6da852d133823bc67c66
Author: Yin Huai <hu...@cse.ohio-state.edu>
Date:   2014-10-09T21:57:27Z

    [SPARK-3339][SQL] Support for skipping json lines that fail to parse
    
    This PR aims to provide a way to skip/query corrupt JSON records. To do so, we introduce an internal column to hold corrupt records (the default name is `_corrupt_record`. This name can be changed by setting the value of `spark.sql.columnNameOfCorruptRecord`). When there is a parsing error, we will put the corrupt record in its unparsed format to the internal column. Users can skip/query this column through SQL.
    
    * To query those corrupt records
    ```
    -- For Hive parser
    SELECT `_corrupt_record`
    FROM jsonTable
    WHERE `_corrupt_record` IS NOT NULL
    -- For our SQL parser
    SELECT _corrupt_record
    FROM jsonTable
    WHERE _corrupt_record IS NOT NULL
    ```
    * To skip corrupt records and query regular records
    ```
    -- For Hive parser
    SELECT field1, field2
    FROM jsonTable
    WHERE `_corrupt_record` IS NULL
    -- For our SQL parser
    SELECT field1, field2
    FROM jsonTable
    WHERE _corrupt_record IS NULL
    ```
    
    Generally, it is not recommended to change the name of the internal column. If the name has to be changed to avoid possible name conflicts, you can use `sqlContext.setConf(SQLConf.COLUMN_NAME_OF_CORRUPT_RECORD, <new column name>)` or `sqlContext.sql(SET spark.sql.columnNameOfCorruptRecord=<new column name>)`.
    
    Author: Yin Huai <hu...@cse.ohio-state.edu>
    
    Closes #2680 from yhuai/corruptJsonRecord and squashes the following commits:
    
    4c9828e [Yin Huai] Merge remote-tracking branch 'upstream/master' into corruptJsonRecord
    309616a [Yin Huai] Change the default name of corrupt record to "_corrupt_record".
    b4a3632 [Yin Huai] Merge remote-tracking branch 'upstream/master' into corruptJsonRecord
    9375ae9 [Yin Huai] Set the column name of corrupt json record back to the default one after the unit test.
    ee584c0 [Yin Huai] Provide a way to query corrupt json records as unparsed strings.

commit 0c0e09f567deb775ee378f5385a16884f68b332d
Author: Daoyuan Wang <da...@intel.com>
Date:   2014-10-09T21:59:03Z

    [SPARK-3412][SQL]add missing row api
    
    chenghao-intel assigned this to me, check PR #2284 for previous discussion
    
    Author: Daoyuan Wang <da...@intel.com>
    
    Closes #2529 from adrian-wang/rowapi and squashes the following commits:
    
    c6594b2 [Daoyuan Wang] using boxed
    7b7e6e3 [Daoyuan Wang] update pattern match
    7a39456 [Daoyuan Wang] rename file and refresh getAs[T]
    4c18c29 [Daoyuan Wang] remove setAs[T] and null judge
    1614493 [Daoyuan Wang] add missing row api

commit bc3b6cb06153d6b05f311dd78459768b6cf6a404
Author: Nathan Howell <nh...@godaddy.com>
Date:   2014-10-09T22:03:01Z

    [SPARK-3858][SQL] Pass the generator alias into logical plan node
    
    The alias parameter is being ignored, which makes it more difficult to specify a qualifier for Generator expressions.
    
    Author: Nathan Howell <nh...@godaddy.com>
    
    Closes #2721 from NathanHowell/SPARK-3858 and squashes the following commits:
    
    8aa0f43 [Nathan Howell] [SPARK-3858][SQL] Pass the generator alias into logical plan node

commit ac302052870a650d56f2d3131c27755bb2960ad7
Author: ravipesala <ra...@huawei.com>
Date:   2014-10-09T22:14:58Z

    [SPARK-3813][SQL] Support "case when" conditional functions in Spark SQL.
    
    "case when" conditional function is already supported in Spark SQL but there is no support in SqlParser. So added parser support to it.
    
    Author : ravipesala ravindra.pesalahuawei.com
    
    Author: ravipesala <ra...@huawei.com>
    
    Closes #2678 from ravipesala/SPARK-3813 and squashes the following commits:
    
    70c75a7 [ravipesala] Fixed styles
    713ea84 [ravipesala] Updated as per admin comments
    709684f [ravipesala] Changed parser to support case when function.

commit 4e9b551a0b807f5a2cc6679165c8be4e88a3d077
Author: Josh Rosen <jo...@apache.org>
Date:   2014-10-09T23:08:07Z

    [SPARK-3772] Allow `ipython` to be used by Pyspark workers; IPython support improvements:
    
    This pull request addresses a few issues related to PySpark's IPython support:
    
    - Fix the remaining uses of the '-u' flag, which IPython doesn't support (see SPARK-3772).
    - Change PYSPARK_PYTHON_OPTS to PYSPARK_DRIVER_PYTHON_OPTS, so that the old name is reserved in case we ever want to allow the worker Python options to be customized (this variable was introduced in #2554 and hasn't landed in a release yet, so this doesn't break any compatibility).
    - Introduce a PYSPARK_DRIVER_PYTHON option that allows the driver to use `ipython` while the workers use a different Python version.
    - Attempt to use Python 2.7 by default if PYSPARK_PYTHON is not specified.
    - Retain the old semantics for IPYTHON=1 and IPYTHON_OPTS (to avoid breaking existing example programs).
    
    There are more details in a block comment in `bin/pyspark`.
    
    Author: Josh Rosen <jo...@apache.org>
    
    Closes #2651 from JoshRosen/SPARK-3772 and squashes the following commits:
    
    7b8eb86 [Josh Rosen] More changes to PySpark python executable configuration:
    c4f5778 [Josh Rosen] [SPARK-3772] Allow ipython to be used by Pyspark workers; IPython fixes:

commit 2837bf8548db7e9d43f6eefedf5a73feb22daedb
Author: Michael Armbrust <mi...@databricks.com>
Date:   2014-10-10T00:54:02Z

    [SPARK-3798][SQL] Store the output of a generator in a val
    
    This prevents it from changing during serialization, leading to corrupted results.
    
    Author: Michael Armbrust <mi...@databricks.com>
    
    Closes #2656 from marmbrus/generateBug and squashes the following commits:
    
    efa32eb [Michael Armbrust] Store the output of a generator in a val. This prevents it from changing during serialization.

commit 363baacaded56047bcc63276d729ab911e0336cf
Author: Sean Owen <so...@cloudera.com>
Date:   2014-10-10T01:21:59Z

    SPARK-3811 [CORE] More robust / standard Utils.deleteRecursively, Utils.createTempDir
    
    I noticed a few issues with how temp directories are created and deleted:
    
    *Minor*
    
    * Guava's `Files.createTempDir()` plus `File.deleteOnExit()` is used in many tests to make a temp dir, but `Utils.createTempDir()` seems to be the standard Spark mechanism
    * Call to `File.deleteOnExit()` could be pushed into `Utils.createTempDir()` as well, along with this replacement
    * _I messed up the message in an exception in `Utils` in SPARK-3794; fixed here_
    
    *Bit Less Minor*
    
    * `Utils.deleteRecursively()` fails immediately if any `IOException` occurs, instead of trying to delete any remaining files and subdirectories. I've observed this leave temp dirs around. I suggest changing it to continue in the face of an exception and throw one of the possibly several exceptions that occur at the end.
    * `Utils.createTempDir()` will add a JVM shutdown hook every time the method is called. Even if the subdir is the parent of another parent dir, since this check is inside the hook. However `Utils` manages a set of all dirs to delete on shutdown already, called `shutdownDeletePaths`. A single hook can be registered to delete all of these on exit. This is how Tachyon temp paths are cleaned up in `TachyonBlockManager`.
    
    I noticed a few other things that might be changed but wanted to ask first:
    
    * Shouldn't the set of dirs to delete be `File`, not just `String` paths?
    * `Utils` manages the set of `TachyonFile` that have been registered for deletion, but the shutdown hook is managed in `TachyonBlockManager`. Should this logic not live together, and not in `Utils`? it's more specific to Tachyon, and looks a slight bit odd to import in such a generic place.
    
    Author: Sean Owen <so...@cloudera.com>
    
    Closes #2670 from srowen/SPARK-3811 and squashes the following commits:
    
    071ae60 [Sean Owen] Update per @vanzin's review
    da0146d [Sean Owen] Make Utils.deleteRecursively try to delete all paths even when an exception occurs; use one shutdown hook instead of one per method call to delete temp dirs
    3a0faa4 [Sean Owen] Standardize on Utils.createTempDir instead of Files.createTempDir

commit edf02da389f75df5a42465d41f035d6b65599848
Author: Cheng Lian <li...@gmail.com>
Date:   2014-10-10T01:25:06Z

    [SPARK-3654][SQL] Unifies SQL and HiveQL parsers
    
    This PR is a follow up of #2590, and tries to introduce a top level SQL parser entry point for all SQL dialects supported by Spark SQL.
    
    A top level parser `SparkSQLParser` is introduced to handle the syntaxes that all SQL dialects should recognize (e.g. `CACHE TABLE`, `UNCACHE TABLE` and `SET`, etc.). For all the syntaxes this parser doesn't recognize directly, it fallbacks to a specified function that tries to parse arbitrary input to a `LogicalPlan`. This function is typically another parser combinator like `SqlParser`. DDL syntaxes introduced in #2475 can be moved to here.
    
    The `ExtendedHiveQlParser` now only handle Hive specific extensions.
    
    Also took the chance to refactor/reformat `SqlParser` for better readability.
    
    Author: Cheng Lian <li...@gmail.com>
    
    Closes #2698 from liancheng/gen-sql-parser and squashes the following commits:
    
    ceada76 [Cheng Lian] Minor styling fixes
    9738934 [Cheng Lian] Minor refactoring, removes optional trailing ";" in the parser
    bb2ab12 [Cheng Lian] SET property value can be empty string
    ce8860b [Cheng Lian] Passes test suites
    e86968e [Cheng Lian] Removes debugging code
    8bcace5 [Cheng Lian] Replaces digit.+ to rep1(digit) (Scala style checking doesn't like it)
    d15d54f [Cheng Lian] Unifies SQL and HiveQL parsers

commit 421382d0e728940caa3e61bc11237c61f256378a
Author: Cheng Lian <li...@gmail.com>
Date:   2014-10-10T01:26:43Z

    [SPARK-3824][SQL] Sets in-memory table default storage level to MEMORY_AND_DISK
    
    Using `MEMORY_AND_DISK` as default storage level for in-memory table caching. Due to the in-memory columnar representation, recomputing an in-memory cached table partitions can be very expensive.
    
    Author: Cheng Lian <li...@gmail.com>
    
    Closes #2686 from liancheng/spark-3824 and squashes the following commits:
    
    35d2ed0 [Cheng Lian] Removes extra space
    1ab7967 [Cheng Lian] Reduces test data size to fit DiskStore.getBytes()
    ba565f0 [Cheng Lian] Maks CachedBatch serializable
    07f0204 [Cheng Lian] Sets in-memory table default storage level to MEMORY_AND_DISK

commit 6f98902a3d7749e543bc493a8c62b1e3a7b924cc
Author: ravipesala <ra...@huawei.com>
Date:   2014-10-10T01:41:36Z

    [SPARK-3834][SQL] Backticks not correctly handled in subquery aliases
    
    The queries like SELECT a.key FROM (SELECT key FROM src) \`a\` does not work as backticks in subquery aliases are not handled properly. This PR fixes that.
    
    Author : ravipesala ravindra.pesalahuawei.com
    
    Author: ravipesala <ra...@huawei.com>
    
    Closes #2737 from ravipesala/SPARK-3834 and squashes the following commits:
    
    0e0ab98 [ravipesala] Fixing issue in backtick handling for subquery aliases

commit 411cf29fff011561f0093bb6101af87842828369
Author: Anand Avati <av...@redhat.com>
Date:   2014-10-10T07:46:56Z

    [SPARK-2805] Upgrade Akka to 2.3.4
    
    This is a second rev of the Akka upgrade (earlier merged, but reverted). I made a slight modification which is that I also upgrade Hive to deal with a compatibility issue related to the protocol buffers library.
    
    Author: Anand Avati <av...@redhat.com>
    Author: Patrick Wendell <pw...@gmail.com>
    
    Closes #2752 from pwendell/akka-upgrade and squashes the following commits:
    
    4c7ca3f [Patrick Wendell] Upgrading to new hive->protobuf version
    57a2315 [Anand Avati] SPARK-1812: streaming - remove tests which depend on akka.actor.IO
    2a551d3 [Anand Avati] SPARK-1812: core - upgrade to akka 2.3.4

commit 90f73fcc47c7bf881f808653d46a9936f37c3c31
Author: Aaron Davidson <aa...@databricks.com>
Date:   2014-10-10T08:44:36Z

    [SPARK-3889] Attempt to avoid SIGBUS by not mmapping files in ConnectionManager
    
    In general, individual shuffle blocks are frequently small, so mmapping them often creates a lot of waste. It may not be bad to mmap the larger ones, but it is pretty inconvenient to get configuration into ManagedBuffer, and besides it is unlikely to help all that much.
    
    Author: Aaron Davidson <aa...@databricks.com>
    
    Closes #2742 from aarondav/mmap and squashes the following commits:
    
    a152065 [Aaron Davidson] Add other pathway back
    52b6cd2 [Aaron Davidson] [SPARK-3889] Attempt to avoid SIGBUS by not mmapping files in ConnectionManager

commit 72f36ee571ad27c7c7c70bb9aecc7e6ef51dfd44
Author: Davies Liu <da...@gmail.com>
Date:   2014-10-10T21:14:05Z

    [SPARK-3886] [PySpark] use AutoBatchedSerializer by default
    
    Use AutoBatchedSerializer by default, which will choose the proper batch size based on size of serialized objects, let the size of serialized batch fall in into  [64k - 640k].
    
    In JVM, the serializer will also track the objects in batch to figure out duplicated objects, larger batch may cause OOM in JVM.
    
    Author: Davies Liu <da...@gmail.com>
    
    Closes #2740 from davies/batchsize and squashes the following commits:
    
    52cdb88 [Davies Liu] update docs
    185f2b9 [Davies Liu] use AutoBatchedSerializer by default

commit 1d72a30874a88bdbab75217f001cf2af409016e7
Author: Patrick Wendell <pw...@gmail.com>
Date:   2014-10-10T23:49:19Z

    HOTFIX: Fix build issue with Akka 2.3.4 upgrade.
    
    We had to upgrade our Hive 0.12 version as well to deal with a protobuf
    conflict (both hive and akka have been using a shaded protobuf version).
    This is testing a correctly patched version of Hive 0.12.
    
    Author: Patrick Wendell <pw...@gmail.com>
    
    Closes #2756 from pwendell/hotfix and squashes the following commits:
    
    cc979d0 [Patrick Wendell] HOTFIX: Fix build issue with Akka 2.3.4 upgrade.

commit 0e8203f4fb721158fb27897680da476174d24c4b
Author: Prashant Sharma <pr...@imaginea.com>
Date:   2014-10-11T01:39:55Z

    [SPARK-2924] Required by scala 2.11, only one fun/ctor amongst overriden alternatives, can have default argument(s).
    
    ...riden alternatives, can have default argument.
    
    Author: Prashant Sharma <pr...@imaginea.com>
    
    Closes #2750 from ScrapCodes/SPARK-2924/default-args-removed and squashes the following commits:
    
    d9785c3 [Prashant Sharma] [SPARK-2924] Required by scala 2.11, only one function/ctor amongst overriden alternatives, can have default argument.

commit 81015a2ba49583d730ce65b2262f50f1f2451a79
Author: cocoatomo <co...@gmail.com>
Date:   2014-10-11T18:26:17Z

    [SPARK-3867][PySpark] ./python/run-tests failed when it run with Python 2.6 and unittest2 is not installed
    
    ./python/run-tests search a Python 2.6 executable on PATH and use it if available.
    When using Python 2.6, it is going to import unittest2 module which is not a standard library in Python 2.6, so it fails with ImportError.
    
    Author: cocoatomo <co...@gmail.com>
    
    Closes #2759 from cocoatomo/issues/3867-unittest2-import-error and squashes the following commits:
    
    f068eb5 [cocoatomo] [SPARK-3867] ./python/run-tests failed when it run with Python 2.6 and unittest2 is not installed

commit 7a3f589ef86200f99624fea8322e5af0cad774a7
Author: cocoatomo <co...@gmail.com>
Date:   2014-10-11T18:51:59Z

    [SPARK-3909][PySpark][Doc] A corrupted format in Sphinx documents and building warnings
    
    Sphinx documents contains a corrupted ReST format and have some warnings.
    
    The purpose of this issue is same as https://issues.apache.org/jira/browse/SPARK-3773.
    
    commit: 0e8203f4fb721158fb27897680da476174d24c4b
    
    output
    ```
    $ cd ./python/docs
    $ make clean html
    rm -rf _build/*
    sphinx-build -b html -d _build/doctrees   . _build/html
    Making output directory...
    Running Sphinx v1.2.3
    loading pickled environment... not yet created
    building [html]: targets for 4 source files that are out of date
    updating environment: 4 added, 0 changed, 0 removed
    reading sources... [100%] pyspark.sql
    /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/feature.py:docstring of pyspark.mllib.feature.Word2VecModel.findSynonyms:4: WARNING: Field list ends without a blank line; unexpected unindent.
    /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/feature.py:docstring of pyspark.mllib.feature.Word2VecModel.transform:3: WARNING: Field list ends without a blank line; unexpected unindent.
    /Users/<user>/MyRepos/Scala/spark/python/pyspark/sql.py:docstring of pyspark.sql:4: WARNING: Bullet list ends without a blank line; unexpected unindent.
    looking for now-outdated files... none found
    pickling environment... done
    checking consistency... done
    preparing documents... done
    writing output... [100%] pyspark.sql
    writing additional files... (12 module code pages) _modules/index search
    copying static files... WARNING: html_static_path entry u'/Users/<user>/MyRepos/Scala/spark/python/docs/_static' does not exist
    done
    copying extra files... done
    dumping search index... done
    dumping object inventory... done
    build succeeded, 4 warnings.
    
    Build finished. The HTML pages are in _build/html.
    ```
    
    Author: cocoatomo <co...@gmail.com>
    
    Closes #2766 from cocoatomo/issues/3909-sphinx-build-warnings and squashes the following commits:
    
    2c7faa8 [cocoatomo] [SPARK-3909][PySpark][Doc] A corrupted format in Sphinx documents and building warnings

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-2426: Quadratic Minimizati...

Posted by debasish83 <gi...@git.apache.org>.

Github user debasish83 closed the pull request at:

    https://github.com/apache/spark/pull/2705


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-2426: Quadratic Minimizati...

Posted by Chanda <gi...@git.apache.org>.

Github user Chanda commented on the pull request:

    https://github.com/apache/spark/pull/2705#issuecomment-61212583
  
    @debasish83  Thanks for your suggestions. Joptimizer is working for smaller datasets but the issue is with larger datasets where the feature size is around 30,000 and the rank is very large. Let me know when you complete the formulations for larger ranks. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-2426: Quadratic Minimizati...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2705#issuecomment-58572813
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-2426: Quadratic Minimizati...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2705#issuecomment-58304422
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-2426: Quadratic Minimizati...

Posted by debasish83 <gi...@git.apache.org>.

Github user debasish83 commented on the pull request:

    https://github.com/apache/spark/pull/2705#issuecomment-58862691
  
    @chanda what's your problem formulation?
    min x'Hx + c'x
    s.t Ax <= B
    You can write it as min x'Hx + c'x + g(z)
    s.t Ax = B + z
    g(z) here is indicator function that z >= 0
    
    Now we can solve this using QuadraticMinimizer.scala...Let me know if this formulation makes sense and I will point to the rest of the steps to you...
    
    By the way I am working on adding H as sparse matrix but it will take some time since we need LDL factorization and that's in ECOS code base...Once I make the ECOS jar available we should be able to use LDL from there...
    
    Is your matrix sparse since you keep sparse kernel for SVM and not all entries from RBF ?
    
    For now I will say use the dense formulation, partition your kernel matrix and solve a QP on each worker and then combine the results using treeAggregate on master...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-2426: Quadratic Minimizati...

Posted by debasish83 <gi...@git.apache.org>.

Github user debasish83 commented on the pull request:

    https://github.com/apache/spark/pull/2705#issuecomment-58450751
  
    @mengxr could you please take a first pass at it...I am focused at decreasing the iteration count of the proximal algorithm
    
    @rezzazadeh could you please see if the quadratic problem ideas mentioned in your paper (http://arxiv.org/pdf/1410.0342v1.pdf) are captured in this PR. We would like to integrate the convex loss at the earliest as some of our use-cases we would like to experiment with hinger/huber loss...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org