You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2015/06/23 07:01:00 UTC

[jira] [Comment Edited] (SPARK-4038) Outlier Detection Algorithm for MLlib

    [ https://issues.apache.org/jira/browse/SPARK-4038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597166#comment-14597166 ] 

Joseph K. Bradley edited comment on SPARK-4038 at 6/23/15 5:00 AM:
-------------------------------------------------------------------

K-Means seemed like the easiest choice for implementation + general usefulness.

For AVF and LOF, it'd be good to get feedback about use cases since I'm not that familiar with those.  (Are they among the most commonly used methods?  In what applications?)
* I noticed someone wrote AVF for Spark, though I have not looked at the code yet: [https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark]

KNN sounds expensive in a distributed setting.  That should probably come later.

For my records, linking some papers here:
* [AVF | http://enriquegortiz.com/wordpress/enriquegortiz/research/undergraduate/outlier-detection/]
* [LOF: Identifying Density-Based Local Outliers | http://www.dbs.ifi.lmu.de/Publikationen/Papers/LOF.pdf]
* [about distributed outlier detection | http://etd.fcla.edu/CF/CFE0002734/Koufakou_Anna_200908_PhD.pdf]

(If others have references, please link them too!)


was (Author: josephkb):
K-Means seemed like the easiest choice for implementation + general usefulness.

For AVF and LOF, it'd be good to get feedback about use cases since I'm not that familiar with those.  (Are they among the most commonly used methods?  In what applications?)
* I noticed someone wrote AVF for Spark, though I have not looked at the code yet: [https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark]

KNN sounds expensive in a distributed setting.  That should probably come later.


> Outlier Detection Algorithm for MLlib
> -------------------------------------
>
>                 Key: SPARK-4038
>                 URL: https://issues.apache.org/jira/browse/SPARK-4038
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Ashutosh Trivedi
>            Priority: Minor
>
> The aim of this JIRA is to discuss about which parallel outlier detection algorithms can be included in MLlib. 
> The one which I am familiar with is Attribute Value Frequency (AVF). It scales linearly with the number of data points and attributes, and relies on a single data scan. It is not distance based and well suited for categorical data. In original paper  a parallel version is also given, which is not complected to implement.  I am working on the implementation and soon submit the initial code for review.
> Here is the Link for the paper
> http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4410382
> As pointed out by Xiangrui in discussion 
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-td8880.html
> There are other algorithms also. Lets discuss about which will be more general and easily paralleled.
>    



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org