You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by husseinhazimeh <gi...@git.apache.org> on 2016/07/08 01:08:54 UTC

[GitHub] spark pull request #14101: [SPARK-16431] [ML] Add a unified method that acce...

GitHub user husseinhazimeh opened a pull request:

    https://github.com/apache/spark/pull/14101

    [SPARK-16431] [ML] Add a unified method that accepts single instances to feature transformers and predictors

    ## What changes were proposed in this pull request?
    Current feature transformers in spark.ml can only operate on DataFrames and don't have a method that accepts single instances. A typical transformer has a User-Defined Function (udf) in its `transform` method which includes a set of operations on the features of a single instance:
    
    ```
    val column_operation = udf {operations on single instance}
    ```
    
    Adding a new method called `transformInstance` that operates directly on single instances and using it in the udf instead can be useful:
    
    ```
    def transformInstance(features: featuresType): OutputType = {operations on single instance}
    
    val column_operation = udf {transformInstance}
    ```
    
    Predictors also don't have a public method that does predictions on single instances. `transformInstance` can be easily added to predictors by acting as a wrapper for the internal method predict (which takes features as input).
    
    Note: The proposed method in this change is added to all predictors and feature transformers except OnehotEncoder, VectorSlicer, and Word2Vec, which might require bigger changes due to dependencies on the dataset's schema (they can be fixed using simple hacks but this needs to be discussed)
    
    ## Benefits
    
    1. Providing a low-latency transformation/prediction method to support machine learning applications that require real-time predictions. The current `transform` method has a relatively high latency when transforming single instances or small batches due to the overhead introduced by DataFrame operations. I measured the latency required to classify a single instance in the 20 Newsgroups dataset using the current `transform` method and the proposed `transformInstance`.  The ML pipeline contains a tokenizer, stopword remover, TF hasher, IDF, scaler, and Logisitc Regression. The table below shows the latency percentiles in milliseconds after measuring the time to classify 700 documents.
    
     Transformation Method | P50 | P90 | P99 | Max
     --------------------- | --- | --- | --- | ---
     transform | 31.44 | 39.43 | 67.75 | 126.97
     transformInstance | 0.16 | 0.38 | 1.16 | 3.2
    
     `transformInstance` is 200 times faster on average and can classify a document in less than a millisecond.  By profiling the code of `transform`, it turns out that every transformer in the pipeline wastes 5 milliseconds on average in DataFrame-related operations when transforming a single instance. This implies that the latency increases linearly with the pipeline size which can be problematic.
     
    2. Increasing code readability and allowing easier debugging as operations on rows are now combined into a function that can be tested independently of the higher-level `transform` method.
    
    3. Adding flexibility to create new models: for example, check this [comment](https://github.com/apache/spark/pull/8883#issuecomment-215559305) on supporting new ensemble methods.
    
    ## How was this patch tested?
    The current tests for transformers and predictors, which invoke `transformInstance` internally, passed. 
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/husseinhazimeh/spark lowlatency

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14101.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14101
    
----
commit e8b3de1e599225fa71fecc17aaa34998863fb38b
Author: Hussein Hazimeh <ha...@mit.edu>
Date:   2016-07-07T20:50:22Z

    Add transformInstance method to predictors and transformers

commit ca213e338bde7da2e308b2ffd9c3fa1b5d26122e
Author: Hussein Hazimeh <hh...@ieee.org>
Date:   2016-07-07T21:03:46Z

    Update LogisticRegression.scala

commit 1fe5b18a0519d324ed53108ddd809a421a811f50
Author: Hussein Hazimeh <hh...@ieee.org>
Date:   2016-07-07T21:21:45Z

    Update HashingTF.scala

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14101: [SPARK-16431] [ML] Add a unified method that acce...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/14101


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14101: [SPARK-16431] [ML] Add a unified method that accepts sin...

Posted by husseinhazimeh <gi...@git.apache.org>.
Github user husseinhazimeh commented on the issue:

    https://github.com/apache/spark/pull/14101
  
    @rxin your feedback would be appreciated


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14101: [SPARK-16431] [ML] Add a unified method that accepts sin...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/14101
  
    I don't know ML that well.
    
    cc @jkbradley @thunterdb @dbtsai @yanboliang 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14101: [SPARK-16431] [ML] Add a unified method that accepts sin...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/14101
  
    I just responded on the main JIRA.  Can you please check that out and close this issue for now?  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14101: [SPARK-16431] [ML] Add a unified method that accepts sin...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14101
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14101: [SPARK-16431] [ML] Add a unified method that accepts sin...

Posted by husseinhazimeh <gi...@git.apache.org>.
Github user husseinhazimeh commented on the issue:

    https://github.com/apache/spark/pull/14101
  
    @mengxr @sethah can you review this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org