You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ashutosh Trivedi (JIRA)" <ji...@apache.org> on 2014/11/12 11:53:34 UTC

[jira] [Comment Edited] (SPARK-4038) Outlier Detection Algorithm for MLlib

    [ https://issues.apache.org/jira/browse/SPARK-4038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207925#comment-14207925 ] 

Ashutosh Trivedi edited comment on SPARK-4038 at 11/12/14 10:53 AM:
--------------------------------------------------------------------

The questions raised are valid and we want community to discuss it. 

This algorithm deals with categorical data, It uses the simplest approach by calculating frequency of each attribute in the data set. Some of the people in community are already doing the review and I am working on it.

I did not find any other algorithm which work on categorical data to find outliers. If you are aware of any other algorithm which is well known please share with us.

  


was (Author: rusty):
The questions raised are valid and we want community to discuss it. 

This algorithm deals with categorical data, In my knowledge it uses the simplest approach by calculating frequency of each attribute in the data set. Some of the people in community are already doing the review and I am working on it.

I did not find any other algorithm which work on categorical data to find outliers. If you are aware of any other algorithm which is well known please share with us.

  

> Outlier Detection Algorithm for MLlib
> -------------------------------------
>
>                 Key: SPARK-4038
>                 URL: https://issues.apache.org/jira/browse/SPARK-4038
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Ashutosh Trivedi
>            Priority: Minor
>
> The aim of this JIRA is to discuss about which parallel outlier detection algorithms can be included in MLlib. 
> The one which I am familiar with is Attribute Value Frequency (AVF). It scales linearly with the number of data points and attributes, and relies on a single data scan. It is not distance based and well suited for categorical data. In original paper  a parallel version is also given, which is not complected to implement.  I am working on the implementation and soon submit the initial code for review.
> Here is the Link for the paper
> http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4410382
> As pointed out by Xiangrui in discussion 
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-td8880.html
> There are other algorithms also. Lets discuss about which will be more general and easily paralleled.
>    



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org