You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Ashutosh <as...@iiitb.org> on 2014/10/21 11:23:00 UTC

[MLlib] Contributing Algorithm for Outlier Detection

Hi,
I am new to Apache Spark (any open source project). I want to contribute to
it. I found that MLlib has no algorithm for outlier detection yet.  By
literature review I found the algorithm Attribute Value Frequency (AVF) is
promising. Here is the link  DOI: 10.1109/ICTAI.2007.125

By following the process I figured out that, I have to open a new feature
request at JIRA (https://issues.apache.org/jira/browse/SPARK). Also, I have
checked that no other issue is opened on "outlier detection".

I want to know is it the right way to go? What project owners have in mind
about outlier detection? Also is anybody working on parallel K nearest
neighbour?

Apart from opening up the feature request then pull request from git, How to
provide the test cases? 

Suggestions and guidance are welcome.

Thanks,
Ashutosh 



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [MLlib] Contributing Algorithm for Outlier Detection

Posted by Ashutosh <as...@iiitb.org>.

Hi Anant,


I have removed the counter and all possible side effects. Now I think we can go ahead with the testing. I have created another folder for testing. I will add you as a collaborator in github .


_Ashutosh

________________________________
From: slcclimber [via Apache Spark Developers List] <ml...@n3.nabble.com>
Sent: Monday, November 17, 2014 10:45 AM
To: Ashutosh Trivedi (MT2013030)
Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection

Ashutosh,
The counter will certainly be an parellization issue when multiple nodes are used specially over massive datasets.
A better approach would be to use some thing along these lines:

    val index = sc.parallelize(Range.Long(0, rdd.count, 1), rdd.partitions.size)
    val rddWithIndex = rdd.zip(index)
Which zips the two RDD's in a parallelizable fashion.


________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9399.html
To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=8880&code=YXNodXRvc2gudHJpdmVkaUBpaWl0Yi5vcmd8ODg4MHwtMzkzMzE5NzYx>.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9420.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: [MLlib] Contributing Algorithm for Outlier Detection

Posted by slcclimber <an...@gmail.com>.

Ashutosh,
The counter will certainly be an parellization issue when multiple nodes are
used specially over massive datasets.
A better approach would be to use some thing along these lines:

    val index = sc.parallelize(Range.Long(0, rdd.count, 1),
rdd.partitions.size)
    val rddWithIndex = rdd.zip(index)
Which zips the two RDD's in a parallelizable fashion.




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9399.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [MLlib] Contributing Algorithm for Outlier Detection

Posted by slcclimber <an...@gmail.com>.

That would be a very wise decision.
On Nov 20, 2014 3:53 PM, "Joseph Bradley [via Apache Spark Developers
List]" <ml...@n3.nabble.com> wrote:

> Could we move discussion of the design and implementation to the JIRA
> and/or a work-in-progress PR (tagged with [WIP])?  That will help leave a
> record for the future.
> Thanks!
> Joseph
>
> On Wed, Nov 19, 2014 at 9:59 PM, Ashutosh <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=9467&i=0>>
> wrote:
>
> > Done. Thanks. Added you as a collaborator. So that you can add code in
> it.
> >
> >
> > Thanks,
> >
> > Ashutosh
> >
> > ________________________________
> > From: slcclimber [via Apache Spark Developers List] <
> > [hidden email] <http://user/SendEmail.jtp?type=node&node=9467&i=1>>
> > Sent: Thursday, November 20, 2014 7:49 AM
> > To: Ashutosh Trivedi (MT2013030)
> > Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection
> >
> > You could also use rdd.zipWithIndex() to create indexes.
> > Anant
> >
> > ________________________________
> > If you reply to this email, your message will be added to the discussion
> > below:
> >
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9441.html
> > To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> Detection,
> > click here<
> > >.
> > NAML<
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> > >
> >
> >
> >
> >
> > --
> > View this message in context:
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9444.html
> > Sent from the Apache Spark Developers List mailing list archive at
> > Nabble.com.
> >
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9467.html
>  To start a new topic under Apache Spark Developers List, email
> ml-node+s1001551n1h44@n3.nabble.com
> To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click
> here
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=8880&code=YW5hbnQuYXN0eUBnbWFpbC5jb218ODg4MHwxOTU2OTQ5NjMy>
> .
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9468.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: [MLlib] Contributing Algorithm for Outlier Detection

Posted by Joseph Bradley <jo...@databricks.com>.

Could we move discussion of the design and implementation to the JIRA
and/or a work-in-progress PR (tagged with [WIP])?  That will help leave a
record for the future.
Thanks!
Joseph

On Wed, Nov 19, 2014 at 9:59 PM, Ashutosh <as...@iiitb.org>
wrote:

> Done. Thanks. Added you as a collaborator. So that you can add code in it.
>
>
> Thanks,
>
> Ashutosh
>
> ________________________________
> From: slcclimber [via Apache Spark Developers List] <
> ml-node+s1001551n9441h47@n3.nabble.com>
> Sent: Thursday, November 20, 2014 7:49 AM
> To: Ashutosh Trivedi (MT2013030)
> Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection
>
> You could also use rdd.zipWithIndex() to create indexes.
> Anant
>
> ________________________________
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9441.html
> To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection,
> click here<
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=8880&code=YXNodXRvc2gudHJpdmVkaUBpaWl0Yi5vcmd8ODg4MHwtMzkzMzE5NzYx
> >.
> NAML<
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9444.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>

Re: [MLlib] Contributing Algorithm for Outlier Detection

Posted by Ashutosh <as...@iiitb.org>.

Done. Thanks. Added you as a collaborator. So that you can add code in it.


Thanks,

Ashutosh

________________________________
From: slcclimber [via Apache Spark Developers List] <ml...@n3.nabble.com>
Sent: Thursday, November 20, 2014 7:49 AM
To: Ashutosh Trivedi (MT2013030)
Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection

You could also use rdd.zipWithIndex() to create indexes.
Anant

________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9441.html
To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=8880&code=YXNodXRvc2gudHJpdmVkaUBpaWl0Yi5vcmd8ODg4MHwtMzkzMzE5NzYx>.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9444.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: [MLlib] Contributing Algorithm for Outlier Detection

Posted by slcclimber <an...@gmail.com>.

You could also use rdd.zipWithIndex() to create indexes.
Anant



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9441.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [MLlib] Contributing Algorithm for Outlier Detection

Posted by Ashutosh <as...@iiitb.org>.

Please use the following snippet. I am still working on to make it a generic vector, so that input
should not Vector[String] always. But String will work fine for now.


def main(args:Array[String])
 {
  val sc = new SparkContext("local", "OutlierDetection")
  val dir = "hdfs://localhost:54310/train3"      <your file path>

   val data = sc.textFile(dir).map(word => word.split(",").toVector)
   val model = OutlierWithAVFModel.outliers(data,20,sc)

   model.score.saveAsTextFile("../scores")
   model.trimmed_data.saveAsTextFile(".../trimmed")
 }


________________________________
From: Meethu Mathew-2 [via Apache Spark Developers List] <ml...@n3.nabble.com>
Sent: Friday, November 14, 2014 11:42 AM
To: Ashutosh Trivedi (MT2013030)
Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection


Hi,

I have a doubt regarding the input to your algorithm.
_<http://www.linkedin.com/home?trk=hb_tab_home_top>_

val model = OutlierWithAVFModel.outliers(data :RDD[Vector[String]],
percent : Double, sc :SparkContext)


Here our input  data is an RDD[Vector[String]]. How we can create this
RDD from a file? sc.textFile will simply give us an RDD, how to make it
a Vector[String]?


Could you plz share any code snippet of this conversion if you have..


Regards,
Meethu Mathew

On Friday 14 November 2014 10:02 AM, Meethu Mathew wrote:

> Hi Ashutosh,
>
> Please edit the README file.I think the following function call is
> changed now.
>
> |model = OutlierWithAVFModel.outliers(master:String, input dir:String , percentage:Double||)
> |
>
> Regards,
>
> *Meethu Mathew*
>
> *Engineer*
>
> *Flytxt*
>
> _<http://www.linkedin.com/home?trk=hb_tab_home_top>_
>
> On Friday 14 November 2014 12:01 AM, Ashutosh wrote:
>> Hi Anant,
>>
>> Please see the changes.
>>
>> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
>>
>>
>> I have changed the input format to Vector of String. I think we can also make it generic.
>>
>>
>> Line 59 & 72 : that counter will not affect in parallelism, Since it only work on one datapoint. It  only                         does the Indexing of the column.
>>
>>
>> Rest all side effects have been removed.
>>
>> 
>>
>> Thanks,
>>
>> Ashutosh
>>
>>
>>
>>
>> ________________________________
>> From: slcclimber [via Apache Spark Developers List] <[hidden email]</user/SendEmail.jtp?type=node&node=9352&i=0>>
>> Sent: Tuesday, November 11, 2014 11:46 PM
>> To: Ashutosh Trivedi (MT2013030)
>> Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection
>>
>>
>> Mayur,
>> Libsvm format sounds good to me. I could work on writing the tests if that helps you?
>> Anant
>>
>> On Nov 11, 2014 11:06 AM, "Ashutosh [via Apache Spark Developers List]" <[hidden email]</user/SendEmail.jtp?type=node&node=9287&i=0>> wrote:
>>
>> Hi Mayur,
>>
>> Vector data types are implemented using breeze library, it is presented at
>>
>> .../org/apache/spark/mllib/linalg
>>
>>
>> Anant,
>>
>> One restriction I found that a vector can only be of 'Double', so it actually restrict the user.
>>
>> What are you thoughts on LibSVM format?
>>
>> Thanks for the comments, I was just trying to get away from those increment /decrement functions, they look ugly. Points are noted. I'll try to fix them soon. Tests are also required for the code.
>>
>>
>> Regards,
>>
>> Ashutosh
>>
>>
>> ________________________________
>> From: Mayur Rustagi [via Apache Spark Developers List] <ml-node+[hidden email]<http://user/SendEmail.jtp?type=node&node=9286&i=0>>
>> Sent: Saturday, November 8, 2014 12:52 PM
>> To: Ashutosh Trivedi (MT2013030)
>> Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection
>>
>>> We should take a vector instead giving the user flexibility to decide
>>> data source/ type
>> What do you mean by vector datatype exactly?
>>
>> Mayur Rustagi
>> Ph: <a href="tel:%2B1%20%28760%29%20203%203257" value="+17602033257" target="_blank">+1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>
>>
>> On Wed, Nov 5, 2014 at 6:45 AM, slcclimber <[hidden email]<http://user/SendEmail.jtp?type=node&node=9239&i=0>> wrote:
>>
>>> Ashutosh,
>>> I still see a few issues.
>>> 1. On line 112 you are counting using a counter. Since this will happen in
>>> a RDD the counter will cause issues. Also that is not good functional style
>>> to use a filter function with a side effect.
>>> You could use randomSplit instead. This does not the same thing without the
>>> side effect.
>>> 2. Similar shared usage of j in line 102 is going to be an issue as well.
>>> also hash seed does not need to be sequential it could be randomly
>>> generated or hashed on the values.
>>> 3. The compute function and trim scores still runs on a comma separeated
>>> RDD. We should take a vector instead giving the user flexibility to decide
>>> data source/ type. what if we want data from hive tables or parquet or JSON
>>> or avro formats. This is a very restrictive format. With vectors the user
>>> has the choice of taking in whatever data format and converting them to
>>> vectors insteda of reading json files creating a csv file and then workig
>>> on that.
>>> 4. Similar use of counters in 54 and 65 is an issue.
>>> Basically the shared state counters is a huge issue that does not scale.
>>> Since the processing of RDD's is distributed and the value j lives on the
>>> master.
>>>
>>> Anant
>>>
>>>
>>>
>>> On Tue, Nov 4, 2014 at 7:22 AM, Ashutosh [via Apache Spark Developers List]
>>> <[hidden email]<http://user/SendEmail.jtp?type=node&node=9239&i=1>> wrote:
>>>
>>>>    Anant,
>>>>
>>>> I got rid of those increment/ decrements functions and now code is much
>>>> cleaner. Please check. All your comments have been looked after.
>>>>
>>>>
>>>>
>>> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
>>>>    _Ashu
>>>>
>>>> <
>>> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
>>>>     Outlier-Detection-with-AVF-Spark/OutlierWithAVFModel.scala at master ·
>>>> codeAshu/Outlier-Detection-with-AVF-Spark · GitHub
>>>>    Contribute to Outlier-Detection-with-AVF-Spark development by creating
>>> an
>>>> account on GitHub.
>>>>    Read more...
>>>> <
>>> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
>>>>    ------------------------------
>>>> *From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden
>>>> email] <http://user/SendEmail.jtp?type=node&node=9083&i=0>>
>>>> *Sent:* Friday, October 31, 2014 10:09 AM
>>>> *To:* Ashutosh Trivedi (MT2013030)
>>>> *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection
>>>>
>>>>
>>>> You should create a jira ticket to go with it as well.
>>>> Thanks
>>>> On Oct 30, 2014 10:38 PM, "Ashutosh [via Apache Spark Developers List]"
>>> <[hidden
>>>> email] <http://user/SendEmail.jtp?type=node&node=9037&i=0>> wrote:
>>>>
>>>>>    Okay. I'll try it and post it soon with test case. After that I think
>>>>> we can go ahead with the PR.
>>>>>    ------------------------------
>>>>> *From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden
>>>>> email] <http://user/SendEmail.jtp?type=node&node=9036&i=0>>
>>>>> *Sent:* Friday, October 31, 2014 10:03 AM
>>>>> *To:* Ashutosh Trivedi (MT2013030)
>>>>> *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection
>>>>>
>>>>>
>>>>> Ashutosh,
>>>>> A vector would be a good idea vectors are used very frequently.
>>>>> Test data is usually stored in the spark/data/mllib folder
>>>>>    On Oct 30, 2014 10:31 PM, "Ashutosh [via Apache Spark Developers List]"
>>>>> <[hidden email] <http://user/SendEmail.jtp?type=node&node=9035&i=0>>
>>>>> wrote:
>>>>>
>>>>>> Hi Anant,
>>>>>> sorry for my late reply. Thank you for taking time and reviewing it.
>>>>>>
>>>>>> I have few comments on first issue.
>>>>>>
>>>>>> You are correct on the string (csv) part. But we can not take input of
>>>>>> type you mentioned. We calculate frequency in our function. Otherwise
>>> user
>>>>>> has to do all this computation. I realize that taking a RDD[Vector]
>>> would
>>>>>> be general enough for all. What do you say?
>>>>>>
>>>>>> I agree on rest all the issues. I will correct them soon and post it.
>>>>>> I have a doubt on test cases. Where should I put data while giving test
>>>>>> scripts? or should i generate synthetic data for testing with in the
>>>>>> scripts, how does this work?
>>>>>>
>>>>>> Regards,
>>>>>> Ashutosh
>>>>>>
>>>>>> ------------------------------
>>>>>>    If you reply to this email, your message will be added to the
>>>>>> discussion below:
>>>>>>
>>>>>>
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9034.html
>>>>>>    To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>>>>>> Detection, click here.
>>>>>> NAML
>>>>>> <
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>>>>> ------------------------------
>>>>>    If you reply to this email, your message will be added to the
>>>>> discussion below:
>>>>>
>>>>>
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9035.html
>>>>>    To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>>>>> Detection, click here.
>>>>> NAML
>>>>> <
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>>>>> ------------------------------
>>>>>    If you reply to this email, your message will be added to the
>>>>> discussion below:
>>>>>
>>>>>
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9036.html
>>>>>    To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>>>>> Detection, click here.
>>>>> NAML
>>>>> <
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>>>> ------------------------------
>>>>    If you reply to this email, your message will be added to the discussion
>>>> below:
>>>>
>>>>
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9037.html
>>>>    To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>>> Detection, click
>>>> here.
>>>> NAML
>>>> <
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>>>>
>>>> ------------------------------
>>>>    If you reply to this email, your message will be added to the discussion
>>>> below:
>>>>
>>>>
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9083.html
>>>>    To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>>> Detection, click
>>>> here
>>>> <
>>>>
>>>> .
>>>> NAML
>>>> <
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9095.html
>>> Sent from the Apache Spark Developers List mailing list archive at
>>> Nabble.com.
>>>
>> ________________________________
>> If you reply to this email, your message will be added to the discussion below:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9239.html
>> To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here.
>> NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>>
>> ________________________________
>> If you reply to this email, your message will be added to the discussion below:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9286.html
>> To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here.
>> NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>>
>> ________________________________
>> If you reply to this email, your message will be added to the discussion below:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9287.html
>> To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here<
>> NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9327.html
>> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.



________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9352.html
To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=8880&code=YXNodXRvc2gudHJpdmVkaUBpaWl0Yi5vcmd8ODg4MHwtMzkzMzE5NzYx>.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9353.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: [MLlib] Contributing Algorithm for Outlier Detection

Posted by Meethu Mathew <me...@flytxt.com>.

Hi,

I have a doubt regarding the input to your algorithm.
_<http://www.linkedin.com/home?trk=hb_tab_home_top>_

val model = OutlierWithAVFModel.outliers(data :RDD[Vector[String]], 
percent : Double, sc :SparkContext)


Here our input  data is an RDD[Vector[String]]. How we can create this 
RDD from a file? sc.textFile will simply give us an RDD, how to make it 
a Vector[String]?


Could you plz share any code snippet of this conversion if you have..


Regards,
Meethu Mathew

On Friday 14 November 2014 10:02 AM, Meethu Mathew wrote:
> Hi Ashutosh,
>
> Please edit the README file.I think the following function call is
> changed now.
>
> |model = OutlierWithAVFModel.outliers(master:String, input dir:String , percentage:Double||)
> |
>
> Regards,
>
> *Meethu Mathew*
>
> *Engineer*
>
> *Flytxt*
>
> _<http://www.linkedin.com/home?trk=hb_tab_home_top>_
>
> On Friday 14 November 2014 12:01 AM, Ashutosh wrote:
>> Hi Anant,
>>
>> Please see the changes.
>>
>> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
>>
>>
>> I have changed the input format to Vector of String. I think we can also make it generic.
>>
>>
>> Line 59 & 72 : that counter will not affect in parallelism, Since it only work on one datapoint. It  only                         does the Indexing of the column.
>>
>>
>> Rest all side effects have been removed.
>>
>> 
>>
>> Thanks,
>>
>> Ashutosh
>>
>>
>>
>>
>> ________________________________
>> From: slcclimber [via Apache Spark Developers List] <ml...@n3.nabble.com>
>> Sent: Tuesday, November 11, 2014 11:46 PM
>> To: Ashutosh Trivedi (MT2013030)
>> Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection
>>
>>
>> Mayur,
>> Libsvm format sounds good to me. I could work on writing the tests if that helps you?
>> Anant
>>
>> On Nov 11, 2014 11:06 AM, "Ashutosh [via Apache Spark Developers List]" <[hidden email]</user/SendEmail.jtp?type=node&node=9287&i=0>> wrote:
>>
>> Hi Mayur,
>>
>> Vector data types are implemented using breeze library, it is presented at
>>
>> .../org/apache/spark/mllib/linalg
>>
>>
>> Anant,
>>
>> One restriction I found that a vector can only be of 'Double', so it actually restrict the user.
>>
>> What are you thoughts on LibSVM format?
>>
>> Thanks for the comments, I was just trying to get away from those increment /decrement functions, they look ugly. Points are noted. I'll try to fix them soon. Tests are also required for the code.
>>
>>
>> Regards,
>>
>> Ashutosh
>>
>>
>> ________________________________
>> From: Mayur Rustagi [via Apache Spark Developers List] <ml-node+[hidden email]<http://user/SendEmail.jtp?type=node&node=9286&i=0>>
>> Sent: Saturday, November 8, 2014 12:52 PM
>> To: Ashutosh Trivedi (MT2013030)
>> Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection
>>
>>> We should take a vector instead giving the user flexibility to decide
>>> data source/ type
>> What do you mean by vector datatype exactly?
>>
>> Mayur Rustagi
>> Ph: <a href="tel:%2B1%20%28760%29%20203%203257" value="+17602033257" target="_blank">+1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>
>>
>> On Wed, Nov 5, 2014 at 6:45 AM, slcclimber <[hidden email]<http://user/SendEmail.jtp?type=node&node=9239&i=0>> wrote:
>>
>>> Ashutosh,
>>> I still see a few issues.
>>> 1. On line 112 you are counting using a counter. Since this will happen in
>>> a RDD the counter will cause issues. Also that is not good functional style
>>> to use a filter function with a side effect.
>>> You could use randomSplit instead. This does not the same thing without the
>>> side effect.
>>> 2. Similar shared usage of j in line 102 is going to be an issue as well.
>>> also hash seed does not need to be sequential it could be randomly
>>> generated or hashed on the values.
>>> 3. The compute function and trim scores still runs on a comma separeated
>>> RDD. We should take a vector instead giving the user flexibility to decide
>>> data source/ type. what if we want data from hive tables or parquet or JSON
>>> or avro formats. This is a very restrictive format. With vectors the user
>>> has the choice of taking in whatever data format and converting them to
>>> vectors insteda of reading json files creating a csv file and then workig
>>> on that.
>>> 4. Similar use of counters in 54 and 65 is an issue.
>>> Basically the shared state counters is a huge issue that does not scale.
>>> Since the processing of RDD's is distributed and the value j lives on the
>>> master.
>>>
>>> Anant
>>>
>>>
>>>
>>> On Tue, Nov 4, 2014 at 7:22 AM, Ashutosh [via Apache Spark Developers List]
>>> <[hidden email]<http://user/SendEmail.jtp?type=node&node=9239&i=1>> wrote:
>>>
>>>>    Anant,
>>>>
>>>> I got rid of those increment/ decrements functions and now code is much
>>>> cleaner. Please check. All your comments have been looked after.
>>>>
>>>>
>>>>
>>> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
>>>>    _Ashu
>>>>
>>>> <
>>> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
>>>>     Outlier-Detection-with-AVF-Spark/OutlierWithAVFModel.scala at master ·
>>>> codeAshu/Outlier-Detection-with-AVF-Spark · GitHub
>>>>    Contribute to Outlier-Detection-with-AVF-Spark development by creating
>>> an
>>>> account on GitHub.
>>>>    Read more...
>>>> <
>>> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
>>>>    ------------------------------
>>>> *From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden
>>>> email] <http://user/SendEmail.jtp?type=node&node=9083&i=0>>
>>>> *Sent:* Friday, October 31, 2014 10:09 AM
>>>> *To:* Ashutosh Trivedi (MT2013030)
>>>> *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection
>>>>
>>>>
>>>> You should create a jira ticket to go with it as well.
>>>> Thanks
>>>> On Oct 30, 2014 10:38 PM, "Ashutosh [via Apache Spark Developers List]"
>>> <[hidden
>>>> email] <http://user/SendEmail.jtp?type=node&node=9037&i=0>> wrote:
>>>>
>>>>>    Okay. I'll try it and post it soon with test case. After that I think
>>>>> we can go ahead with the PR.
>>>>>    ------------------------------
>>>>> *From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden
>>>>> email] <http://user/SendEmail.jtp?type=node&node=9036&i=0>>
>>>>> *Sent:* Friday, October 31, 2014 10:03 AM
>>>>> *To:* Ashutosh Trivedi (MT2013030)
>>>>> *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection
>>>>>
>>>>>
>>>>> Ashutosh,
>>>>> A vector would be a good idea vectors are used very frequently.
>>>>> Test data is usually stored in the spark/data/mllib folder
>>>>>    On Oct 30, 2014 10:31 PM, "Ashutosh [via Apache Spark Developers List]"
>>>>> <[hidden email] <http://user/SendEmail.jtp?type=node&node=9035&i=0>>
>>>>> wrote:
>>>>>
>>>>>> Hi Anant,
>>>>>> sorry for my late reply. Thank you for taking time and reviewing it.
>>>>>>
>>>>>> I have few comments on first issue.
>>>>>>
>>>>>> You are correct on the string (csv) part. But we can not take input of
>>>>>> type you mentioned. We calculate frequency in our function. Otherwise
>>> user
>>>>>> has to do all this computation. I realize that taking a RDD[Vector]
>>> would
>>>>>> be general enough for all. What do you say?
>>>>>>
>>>>>> I agree on rest all the issues. I will correct them soon and post it.
>>>>>> I have a doubt on test cases. Where should I put data while giving test
>>>>>> scripts? or should i generate synthetic data for testing with in the
>>>>>> scripts, how does this work?
>>>>>>
>>>>>> Regards,
>>>>>> Ashutosh
>>>>>>
>>>>>> ------------------------------
>>>>>>    If you reply to this email, your message will be added to the
>>>>>> discussion below:
>>>>>>
>>>>>>
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9034.html
>>>>>>    To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>>>>>> Detection, click here.
>>>>>> NAML
>>>>>> <
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>>>>> ------------------------------
>>>>>    If you reply to this email, your message will be added to the
>>>>> discussion below:
>>>>>
>>>>>
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9035.html
>>>>>    To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>>>>> Detection, click here.
>>>>> NAML
>>>>> <
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>>>>> ------------------------------
>>>>>    If you reply to this email, your message will be added to the
>>>>> discussion below:
>>>>>
>>>>>
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9036.html
>>>>>    To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>>>>> Detection, click here.
>>>>> NAML
>>>>> <
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>>>> ------------------------------
>>>>    If you reply to this email, your message will be added to the discussion
>>>> below:
>>>>
>>>>
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9037.html
>>>>    To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>>> Detection, click
>>>> here.
>>>> NAML
>>>> <
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>>>>
>>>> ------------------------------
>>>>    If you reply to this email, your message will be added to the discussion
>>>> below:
>>>>
>>>>
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9083.html
>>>>    To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>>> Detection, click
>>>> here
>>>> <
>>>>
>>>> .
>>>> NAML
>>>> <
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9095.html
>>> Sent from the Apache Spark Developers List mailing list archive at
>>> Nabble.com.
>>>
>> ________________________________
>> If you reply to this email, your message will be added to the discussion below:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9239.html
>> To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here.
>> NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>>
>> ________________________________
>> If you reply to this email, your message will be added to the discussion below:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9286.html
>> To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here.
>> NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>>
>> ________________________________
>> If you reply to this email, your message will be added to the discussion below:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9287.html
>> To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=8880&code=YXNodXRvc2gudHJpdmVkaUBpaWl0Yi5vcmd8ODg4MHwtMzkzMzE5NzYx>.
>> NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9327.html
>> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: [MLlib] Contributing Algorithm for Outlier Detection

Posted by Meethu Mathew <me...@flytxt.com>.

Hi Ashutosh,

Please edit the README file.I think the following function call is 
changed now.

|model = OutlierWithAVFModel.outliers(master:String, input dir:String , percentage:Double||)
|

Regards,

*Meethu Mathew*

*Engineer*

*Flytxt*

_<http://www.linkedin.com/home?trk=hb_tab_home_top>_

On Friday 14 November 2014 12:01 AM, Ashutosh wrote:
> Hi Anant,
>
> Please see the changes.
>
> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
>
>
> I have changed the input format to Vector of String. I think we can also make it generic.
>
>
> Line 59 & 72 : that counter will not affect in parallelism, Since it only work on one datapoint. It  only                         does the Indexing of the column.
>
>
> Rest all side effects have been removed.
>
> 
>
> Thanks,
>
> Ashutosh
>
>
>
>
> ________________________________
> From: slcclimber [via Apache Spark Developers List] <ml...@n3.nabble.com>
> Sent: Tuesday, November 11, 2014 11:46 PM
> To: Ashutosh Trivedi (MT2013030)
> Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection
>
>
> Mayur,
> Libsvm format sounds good to me. I could work on writing the tests if that helps you?
> Anant
>
> On Nov 11, 2014 11:06 AM, "Ashutosh [via Apache Spark Developers List]" <[hidden email]</user/SendEmail.jtp?type=node&node=9287&i=0>> wrote:
>
> Hi Mayur,
>
> Vector data types are implemented using breeze library, it is presented at
>
> .../org/apache/spark/mllib/linalg
>
>
> Anant,
>
> One restriction I found that a vector can only be of 'Double', so it actually restrict the user.
>
> What are you thoughts on LibSVM format?
>
> Thanks for the comments, I was just trying to get away from those increment /decrement functions, they look ugly. Points are noted. I'll try to fix them soon. Tests are also required for the code.
>
>
> Regards,
>
> Ashutosh
>
>
> ________________________________
> From: Mayur Rustagi [via Apache Spark Developers List] <ml-node+[hidden email]<http://user/SendEmail.jtp?type=node&node=9286&i=0>>
> Sent: Saturday, November 8, 2014 12:52 PM
> To: Ashutosh Trivedi (MT2013030)
> Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection
>
>> We should take a vector instead giving the user flexibility to decide
>> data source/ type
> What do you mean by vector datatype exactly?
>
> Mayur Rustagi
> Ph: <a href="tel:%2B1%20%28760%29%20203%203257" value="+17602033257" target="_blank">+1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
> On Wed, Nov 5, 2014 at 6:45 AM, slcclimber <[hidden email]<http://user/SendEmail.jtp?type=node&node=9239&i=0>> wrote:
>
>> Ashutosh,
>> I still see a few issues.
>> 1. On line 112 you are counting using a counter. Since this will happen in
>> a RDD the counter will cause issues. Also that is not good functional style
>> to use a filter function with a side effect.
>> You could use randomSplit instead. This does not the same thing without the
>> side effect.
>> 2. Similar shared usage of j in line 102 is going to be an issue as well.
>> also hash seed does not need to be sequential it could be randomly
>> generated or hashed on the values.
>> 3. The compute function and trim scores still runs on a comma separeated
>> RDD. We should take a vector instead giving the user flexibility to decide
>> data source/ type. what if we want data from hive tables or parquet or JSON
>> or avro formats. This is a very restrictive format. With vectors the user
>> has the choice of taking in whatever data format and converting them to
>> vectors insteda of reading json files creating a csv file and then workig
>> on that.
>> 4. Similar use of counters in 54 and 65 is an issue.
>> Basically the shared state counters is a huge issue that does not scale.
>> Since the processing of RDD's is distributed and the value j lives on the
>> master.
>>
>> Anant
>>
>>
>>
>> On Tue, Nov 4, 2014 at 7:22 AM, Ashutosh [via Apache Spark Developers List]
>> <[hidden email]<http://user/SendEmail.jtp?type=node&node=9239&i=1>> wrote:
>>
>>>   Anant,
>>>
>>> I got rid of those increment/ decrements functions and now code is much
>>> cleaner. Please check. All your comments have been looked after.
>>>
>>>
>>>
>> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
>>>
>>>   _Ashu
>>>
>>> <
>> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
>>>    Outlier-Detection-with-AVF-Spark/OutlierWithAVFModel.scala at master ·
>>> codeAshu/Outlier-Detection-with-AVF-Spark · GitHub
>>>   Contribute to Outlier-Detection-with-AVF-Spark development by creating
>> an
>>> account on GitHub.
>>>   Read more...
>>> <
>> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
>>>
>>>   ------------------------------
>>> *From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden
>>> email] <http://user/SendEmail.jtp?type=node&node=9083&i=0>>
>>> *Sent:* Friday, October 31, 2014 10:09 AM
>>> *To:* Ashutosh Trivedi (MT2013030)
>>> *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection
>>>
>>>
>>> You should create a jira ticket to go with it as well.
>>> Thanks
>>> On Oct 30, 2014 10:38 PM, "Ashutosh [via Apache Spark Developers List]"
>> <[hidden
>>> email] <http://user/SendEmail.jtp?type=node&node=9037&i=0>> wrote:
>>>
>>>>   Okay. I'll try it and post it soon with test case. After that I think
>>>> we can go ahead with the PR.
>>>>   ------------------------------
>>>> *From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden
>>>> email] <http://user/SendEmail.jtp?type=node&node=9036&i=0>>
>>>> *Sent:* Friday, October 31, 2014 10:03 AM
>>>> *To:* Ashutosh Trivedi (MT2013030)
>>>> *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection
>>>>
>>>>
>>>> Ashutosh,
>>>> A vector would be a good idea vectors are used very frequently.
>>>> Test data is usually stored in the spark/data/mllib folder
>>>>   On Oct 30, 2014 10:31 PM, "Ashutosh [via Apache Spark Developers List]"
>>>> <[hidden email] <http://user/SendEmail.jtp?type=node&node=9035&i=0>>
>>>> wrote:
>>>>
>>>>> Hi Anant,
>>>>> sorry for my late reply. Thank you for taking time and reviewing it.
>>>>>
>>>>> I have few comments on first issue.
>>>>>
>>>>> You are correct on the string (csv) part. But we can not take input of
>>>>> type you mentioned. We calculate frequency in our function. Otherwise
>> user
>>>>> has to do all this computation. I realize that taking a RDD[Vector]
>> would
>>>>> be general enough for all. What do you say?
>>>>>
>>>>> I agree on rest all the issues. I will correct them soon and post it.
>>>>> I have a doubt on test cases. Where should I put data while giving test
>>>>> scripts? or should i generate synthetic data for testing with in the
>>>>> scripts, how does this work?
>>>>>
>>>>> Regards,
>>>>> Ashutosh
>>>>>
>>>>> ------------------------------
>>>>>   If you reply to this email, your message will be added to the
>>>>> discussion below:
>>>>>
>>>>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9034.html
>>>>>   To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>>>>> Detection, click here.
>>>>> NAML
>>>>> <
>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>>>>
>>>> ------------------------------
>>>>   If you reply to this email, your message will be added to the
>>>> discussion below:
>>>>
>>>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9035.html
>>>>   To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>>>> Detection, click here.
>>>> NAML
>>>> <
>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>>>>
>>>> ------------------------------
>>>>   If you reply to this email, your message will be added to the
>>>> discussion below:
>>>>
>>>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9036.html
>>>>   To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>>>> Detection, click here.
>>>> NAML
>>>> <
>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>>>
>>> ------------------------------
>>>   If you reply to this email, your message will be added to the discussion
>>> below:
>>>
>>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9037.html
>>>   To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>> Detection, click
>>> here.
>>> NAML
>>> <
>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>>>
>>>
>>> ------------------------------
>>>   If you reply to this email, your message will be added to the discussion
>>> below:
>>>
>>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9083.html
>>>   To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>> Detection, click
>>> here
>>> <
>>>
>>> .
>>> NAML
>>> <
>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9095.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>
> ________________________________
> If you reply to this email, your message will be added to the discussion below:
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9239.html
> To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here.
> NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
>
> ________________________________
> If you reply to this email, your message will be added to the discussion below:
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9286.html
> To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here.
> NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
>
> ________________________________
> If you reply to this email, your message will be added to the discussion below:
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9287.html
> To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=8880&code=YXNodXRvc2gudHJpdmVkaUBpaWl0Yi5vcmd8ODg4MHwtMzkzMzE5NzYx>.
> NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
>
>
>
> --
> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9327.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: [MLlib] Contributing Algorithm for Outlier Detection

Posted by Ashutosh <as...@iiitb.org>.

Hi Anant,

Please see the changes.

https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala


I have changed the input format to Vector of String. I think we can also make it generic.


Line 59 & 72 : that counter will not affect in parallelism, Since it only work on one datapoint. It  only                         does the Indexing of the column.


Rest all side effects have been removed.



Thanks,

Ashutosh




________________________________
From: slcclimber [via Apache Spark Developers List] <ml...@n3.nabble.com>
Sent: Tuesday, November 11, 2014 11:46 PM
To: Ashutosh Trivedi (MT2013030)
Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection


Mayur,
Libsvm format sounds good to me. I could work on writing the tests if that helps you?
Anant

On Nov 11, 2014 11:06 AM, "Ashutosh [via Apache Spark Developers List]" <[hidden email]</user/SendEmail.jtp?type=node&node=9287&i=0>> wrote:

Hi Mayur,

Vector data types are implemented using breeze library, it is presented at

.../org/apache/spark/mllib/linalg


Anant,

One restriction I found that a vector can only be of 'Double', so it actually restrict the user.

What are you thoughts on LibSVM format?

Thanks for the comments, I was just trying to get away from those increment /decrement functions, they look ugly. Points are noted. I'll try to fix them soon. Tests are also required for the code.


Regards,

Ashutosh


________________________________
From: Mayur Rustagi [via Apache Spark Developers List] <ml-node+[hidden email]<http://user/SendEmail.jtp?type=node&node=9286&i=0>>
Sent: Saturday, November 8, 2014 12:52 PM
To: Ashutosh Trivedi (MT2013030)
Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection

>
> We should take a vector instead giving the user flexibility to decide
> data source/ type

What do you mean by vector datatype exactly?

Mayur Rustagi
Ph: <a href="tel:%2B1%20%28760%29%20203%203257" value="+17602033257" target="_blank">+1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>


On Wed, Nov 5, 2014 at 6:45 AM, slcclimber <[hidden email]<http://user/SendEmail.jtp?type=node&node=9239&i=0>> wrote:

> Ashutosh,
> I still see a few issues.
> 1. On line 112 you are counting using a counter. Since this will happen in
> a RDD the counter will cause issues. Also that is not good functional style
> to use a filter function with a side effect.
> You could use randomSplit instead. This does not the same thing without the
> side effect.
> 2. Similar shared usage of j in line 102 is going to be an issue as well.
> also hash seed does not need to be sequential it could be randomly
> generated or hashed on the values.
> 3. The compute function and trim scores still runs on a comma separeated
> RDD. We should take a vector instead giving the user flexibility to decide
> data source/ type. what if we want data from hive tables or parquet or JSON
> or avro formats. This is a very restrictive format. With vectors the user
> has the choice of taking in whatever data format and converting them to
> vectors insteda of reading json files creating a csv file and then workig
> on that.
> 4. Similar use of counters in 54 and 65 is an issue.
> Basically the shared state counters is a huge issue that does not scale.
> Since the processing of RDD's is distributed and the value j lives on the
> master.
>
> Anant
>
>
>
> On Tue, Nov 4, 2014 at 7:22 AM, Ashutosh [via Apache Spark Developers List]
> <[hidden email]<http://user/SendEmail.jtp?type=node&node=9239&i=1>> wrote:
>
> >  Anant,
> >
> > I got rid of those increment/ decrements functions and now code is much
> > cleaner. Please check. All your comments have been looked after.
> >
> >
> >
> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
> >
> >
> >  _Ashu
> >
> > <
> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
> >
> >   Outlier-Detection-with-AVF-Spark/OutlierWithAVFModel.scala at master ·
> > codeAshu/Outlier-Detection-with-AVF-Spark · GitHub
> >  Contribute to Outlier-Detection-with-AVF-Spark development by creating
> an
> > account on GitHub.
> >  Read more...
> > <
> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
> >
> >
> >  ------------------------------
> > *From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden
> > email] <http://user/SendEmail.jtp?type=node&node=9083&i=0>>
> > *Sent:* Friday, October 31, 2014 10:09 AM
> > *To:* Ashutosh Trivedi (MT2013030)
> > *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection
> >
> >
> > You should create a jira ticket to go with it as well.
> > Thanks
> > On Oct 30, 2014 10:38 PM, "Ashutosh [via Apache Spark Developers List]"
> <[hidden
> > email] <http://user/SendEmail.jtp?type=node&node=9037&i=0>> wrote:
> >
> >>  Okay. I'll try it and post it soon with test case. After that I think
> >> we can go ahead with the PR.
> >>  ------------------------------
> >> *From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden
> >> email] <http://user/SendEmail.jtp?type=node&node=9036&i=0>>
> >> *Sent:* Friday, October 31, 2014 10:03 AM
> >> *To:* Ashutosh Trivedi (MT2013030)
> >> *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection
> >>
> >>
> >> Ashutosh,
> >> A vector would be a good idea vectors are used very frequently.
> >> Test data is usually stored in the spark/data/mllib folder
> >>  On Oct 30, 2014 10:31 PM, "Ashutosh [via Apache Spark Developers List]"
> >> <[hidden email] <http://user/SendEmail.jtp?type=node&node=9035&i=0>>
> >> wrote:
> >>
> >>> Hi Anant,
> >>> sorry for my late reply. Thank you for taking time and reviewing it.
> >>>
> >>> I have few comments on first issue.
> >>>
> >>> You are correct on the string (csv) part. But we can not take input of
> >>> type you mentioned. We calculate frequency in our function. Otherwise
> user
> >>> has to do all this computation. I realize that taking a RDD[Vector]
> would
> >>> be general enough for all. What do you say?
> >>>
> >>> I agree on rest all the issues. I will correct them soon and post it.
> >>> I have a doubt on test cases. Where should I put data while giving test
> >>> scripts? or should i generate synthetic data for testing with in the
> >>> scripts, how does this work?
> >>>
> >>> Regards,
> >>> Ashutosh
> >>>
> >>> ------------------------------
> >>>  If you reply to this email, your message will be added to the
> >>> discussion below:
> >>>
> >>>
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9034.html
> >>>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> >>> Detection, click here.
> >>> NAML
> >>> <
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >>>
> >>
> >>
> >> ------------------------------
> >>  If you reply to this email, your message will be added to the
> >> discussion below:
> >>
> >>
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9035.html
> >>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> >> Detection, click here.
> >> NAML
> >> <
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >>
> >>
> >> ------------------------------
> >>  If you reply to this email, your message will be added to the
> >> discussion below:
> >>
> >>
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9036.html
> >>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> >> Detection, click here.
> >> NAML
> >> <
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >>
> >
> >
> > ------------------------------
> >  If you reply to this email, your message will be added to the discussion
> > below:
> >
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9037.html
> >  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> Detection, click
> > here.
> > NAML
> > <
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >
> >
> > ------------------------------
> >  If you reply to this email, your message will be added to the discussion
> > below:
> >
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9083.html
> >  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> Detection, click
> > here
> > <
> >
> > .
> > NAML
> > <
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9095.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>


________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9239.html
To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>


________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9286.html
To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>


________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9287.html
To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=8880&code=YXNodXRvc2gudHJpdmVkaUBpaWl0Yi5vcmd8ODg4MHwtMzkzMzE5NzYx>.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9327.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: [MLlib] Contributing Algorithm for Outlier Detection

Posted by Ashutosh <as...@iiitb.org>.

sure you are welcome.

Let me fix the issues you have pointed out. I'll update you soon by this weekend.


_Ashutosh

________________________________
From: slcclimber [via Apache Spark Developers List] <ml...@n3.nabble.com>
Sent: Tuesday, November 11, 2014 11:46 PM
To: Ashutosh Trivedi (MT2013030)
Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection


Mayur,
Libsvm format sounds good to me. I could work on writing the tests if that helps you?
Anant

On Nov 11, 2014 11:06 AM, "Ashutosh [via Apache Spark Developers List]" <[hidden email]</user/SendEmail.jtp?type=node&node=9287&i=0>> wrote:

Hi Mayur,

Vector data types are implemented using breeze library, it is presented at

.../org/apache/spark/mllib/linalg


Anant,

One restriction I found that a vector can only be of 'Double', so it actually restrict the user.

What are you thoughts on LibSVM format?

Thanks for the comments, I was just trying to get away from those increment /decrement functions, they look ugly. Points are noted. I'll try to fix them soon. Tests are also required for the code.


Regards,

Ashutosh


________________________________
From: Mayur Rustagi [via Apache Spark Developers List] <ml-node+[hidden email]<http://user/SendEmail.jtp?type=node&node=9286&i=0>>
Sent: Saturday, November 8, 2014 12:52 PM
To: Ashutosh Trivedi (MT2013030)
Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection

>
> We should take a vector instead giving the user flexibility to decide
> data source/ type

What do you mean by vector datatype exactly?

Mayur Rustagi
Ph: <a href="tel:%2B1%20%28760%29%20203%203257" value="+17602033257" target="_blank">+1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>


On Wed, Nov 5, 2014 at 6:45 AM, slcclimber <[hidden email]<http://user/SendEmail.jtp?type=node&node=9239&i=0>> wrote:

> Ashutosh,
> I still see a few issues.
> 1. On line 112 you are counting using a counter. Since this will happen in
> a RDD the counter will cause issues. Also that is not good functional style
> to use a filter function with a side effect.
> You could use randomSplit instead. This does not the same thing without the
> side effect.
> 2. Similar shared usage of j in line 102 is going to be an issue as well.
> also hash seed does not need to be sequential it could be randomly
> generated or hashed on the values.
> 3. The compute function and trim scores still runs on a comma separeated
> RDD. We should take a vector instead giving the user flexibility to decide
> data source/ type. what if we want data from hive tables or parquet or JSON
> or avro formats. This is a very restrictive format. With vectors the user
> has the choice of taking in whatever data format and converting them to
> vectors insteda of reading json files creating a csv file and then workig
> on that.
> 4. Similar use of counters in 54 and 65 is an issue.
> Basically the shared state counters is a huge issue that does not scale.
> Since the processing of RDD's is distributed and the value j lives on the
> master.
>
> Anant
>
>
>
> On Tue, Nov 4, 2014 at 7:22 AM, Ashutosh [via Apache Spark Developers List]
> <[hidden email]<http://user/SendEmail.jtp?type=node&node=9239&i=1>> wrote:
>
> >  Anant,
> >
> > I got rid of those increment/ decrements functions and now code is much
> > cleaner. Please check. All your comments have been looked after.
> >
> >
> >
> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
> >
> >
> >  _Ashu
> >
> > <
> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
> >
> >   Outlier-Detection-with-AVF-Spark/OutlierWithAVFModel.scala at master ·
> > codeAshu/Outlier-Detection-with-AVF-Spark · GitHub
> >  Contribute to Outlier-Detection-with-AVF-Spark development by creating
> an
> > account on GitHub.
> >  Read more...
> > <
> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
> >
> >
> >  ------------------------------
> > *From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden
> > email] <http://user/SendEmail.jtp?type=node&node=9083&i=0>>
> > *Sent:* Friday, October 31, 2014 10:09 AM
> > *To:* Ashutosh Trivedi (MT2013030)
> > *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection
> >
> >
> > You should create a jira ticket to go with it as well.
> > Thanks
> > On Oct 30, 2014 10:38 PM, "Ashutosh [via Apache Spark Developers List]"
> <[hidden
> > email] <http://user/SendEmail.jtp?type=node&node=9037&i=0>> wrote:
> >
> >>  Okay. I'll try it and post it soon with test case. After that I think
> >> we can go ahead with the PR.
> >>  ------------------------------
> >> *From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden
> >> email] <http://user/SendEmail.jtp?type=node&node=9036&i=0>>
> >> *Sent:* Friday, October 31, 2014 10:03 AM
> >> *To:* Ashutosh Trivedi (MT2013030)
> >> *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection
> >>
> >>
> >> Ashutosh,
> >> A vector would be a good idea vectors are used very frequently.
> >> Test data is usually stored in the spark/data/mllib folder
> >>  On Oct 30, 2014 10:31 PM, "Ashutosh [via Apache Spark Developers List]"
> >> <[hidden email] <http://user/SendEmail.jtp?type=node&node=9035&i=0>>
> >> wrote:
> >>
> >>> Hi Anant,
> >>> sorry for my late reply. Thank you for taking time and reviewing it.
> >>>
> >>> I have few comments on first issue.
> >>>
> >>> You are correct on the string (csv) part. But we can not take input of
> >>> type you mentioned. We calculate frequency in our function. Otherwise
> user
> >>> has to do all this computation. I realize that taking a RDD[Vector]
> would
> >>> be general enough for all. What do you say?
> >>>
> >>> I agree on rest all the issues. I will correct them soon and post it.
> >>> I have a doubt on test cases. Where should I put data while giving test
> >>> scripts? or should i generate synthetic data for testing with in the
> >>> scripts, how does this work?
> >>>
> >>> Regards,
> >>> Ashutosh
> >>>
> >>> ------------------------------
> >>>  If you reply to this email, your message will be added to the
> >>> discussion below:
> >>>
> >>>
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9034.html
> >>>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> >>> Detection, click here.
> >>> NAML
> >>> <
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >>>
> >>
> >>
> >> ------------------------------
> >>  If you reply to this email, your message will be added to the
> >> discussion below:
> >>
> >>
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9035.html
> >>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> >> Detection, click here.
> >> NAML
> >> <
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >>
> >>
> >> ------------------------------
> >>  If you reply to this email, your message will be added to the
> >> discussion below:
> >>
> >>
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9036.html
> >>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> >> Detection, click here.
> >> NAML
> >> <
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >>
> >
> >
> > ------------------------------
> >  If you reply to this email, your message will be added to the discussion
> > below:
> >
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9037.html
> >  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> Detection, click
> > here.
> > NAML
> > <
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >
> >
> > ------------------------------
> >  If you reply to this email, your message will be added to the discussion
> > below:
> >
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9083.html
> >  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> Detection, click
> > here
> > <
> >
> > .
> > NAML
> > <
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9095.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>


________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9239.html
To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>


________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9286.html
To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>


________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9287.html
To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=8880&code=YXNodXRvc2gudHJpdmVkaUBpaWl0Yi5vcmd8ODg4MHwtMzkzMzE5NzYx>.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9289.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: [MLlib] Contributing Algorithm for Outlier Detection

Posted by slcclimber <an...@gmail.com>.

Mayur,
Libsvm format sounds good to me. I could work on writing the tests if that
helps you?
Anant
On Nov 11, 2014 11:06 AM, "Ashutosh [via Apache Spark Developers List]" <
ml-node+s1001551n9286h79@n3.nabble.com> wrote:

>  Hi Mayur,
>
> Vector data types are implemented using breeze library, it is presented at
>
> .../org/apache/spark/mllib/linalg
>
>
>  Anant,
>
> One restriction I found that a vector can only be of 'Double', so it
> actually restrict the user.
>
> What are you thoughts on LibSVM format?
>
> Thanks for the comments, I was just trying to get away from those
> increment /decrement functions, they look ugly. Points are noted. I'll try
> to fix them soon. Tests are also required for the code.
>
>
>  Regards,
>
> Ashutosh
>
>
>  ------------------------------
> *From:* Mayur Rustagi [via Apache Spark Developers List] <ml-node+[hidden
> email] <http://user/SendEmail.jtp?type=node&node=9286&i=0>>
> *Sent:* Saturday, November 8, 2014 12:52 PM
> *To:* Ashutosh Trivedi (MT2013030)
> *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection
>
>  >
> > We should take a vector instead giving the user flexibility to decide
> > data source/ type
>
> What do you mean by vector datatype exactly?
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
> On Wed, Nov 5, 2014 at 6:45 AM, slcclimber <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=9239&i=0>> wrote:
>
> > Ashutosh,
> > I still see a few issues.
> > 1. On line 112 you are counting using a counter. Since this will happen
> in
> > a RDD the counter will cause issues. Also that is not good functional
> style
> > to use a filter function with a side effect.
> > You could use randomSplit instead. This does not the same thing without
> the
> > side effect.
> > 2. Similar shared usage of j in line 102 is going to be an issue as
> well.
> > also hash seed does not need to be sequential it could be randomly
> > generated or hashed on the values.
> > 3. The compute function and trim scores still runs on a comma separeated
> > RDD. We should take a vector instead giving the user flexibility to
> decide
> > data source/ type. what if we want data from hive tables or parquet or
> JSON
> > or avro formats. This is a very restrictive format. With vectors the
> user
> > has the choice of taking in whatever data format and converting them to
> > vectors insteda of reading json files creating a csv file and then
> workig
> > on that.
> > 4. Similar use of counters in 54 and 65 is an issue.
> > Basically the shared state counters is a huge issue that does not scale.
> > Since the processing of RDD's is distributed and the value j lives on
> the
> > master.
> >
> > Anant
> >
> >
> >
> > On Tue, Nov 4, 2014 at 7:22 AM, Ashutosh [via Apache Spark Developers
> List]
> > <[hidden email] <http://user/SendEmail.jtp?type=node&node=9239&i=1>>
> wrote:
> >
> > >  Anant,
> > >
> > > I got rid of those increment/ decrements functions and now code is
> much
> > > cleaner. Please check. All your comments have been looked after.
> > >
> > >
> > >
> >
> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
> > >
> > >
> > >  _Ashu
> > >
> > > <
> >
> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
> > >
> > >   Outlier-Detection-with-AVF-Spark/OutlierWithAVFModel.scala at master
> ·
> > > codeAshu/Outlier-Detection-with-AVF-Spark · GitHub
> > >  Contribute to Outlier-Detection-with-AVF-Spark development by
> creating
> > an
> > > account on GitHub.
> > >  Read more...
> > > <
> >
> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
> > >
> > >
> > >  ------------------------------
> > > *From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden
> > > email] <http://user/SendEmail.jtp?type=node&node=9083&i=0>>
> > > *Sent:* Friday, October 31, 2014 10:09 AM
> > > *To:* Ashutosh Trivedi (MT2013030)
> > > *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection
> > >
> > >
> > > You should create a jira ticket to go with it as well.
> > > Thanks
> > > On Oct 30, 2014 10:38 PM, "Ashutosh [via Apache Spark Developers
> List]"
> > <[hidden
> > > email] <http://user/SendEmail.jtp?type=node&node=9037&i=0>> wrote:
> > >
> > >>  Okay. I'll try it and post it soon with test case. After that I
> think
> > >> we can go ahead with the PR.
> > >>  ------------------------------
> > >> *From:* slcclimber [via Apache Spark Developers List]
> <ml-node+[hidden
> > >> email] <http://user/SendEmail.jtp?type=node&node=9036&i=0>>
> > >> *Sent:* Friday, October 31, 2014 10:03 AM
> > >> *To:* Ashutosh Trivedi (MT2013030)
> > >> *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection
> > >>
> > >>
> > >> Ashutosh,
> > >> A vector would be a good idea vectors are used very frequently.
> > >> Test data is usually stored in the spark/data/mllib folder
> > >>  On Oct 30, 2014 10:31 PM, "Ashutosh [via Apache Spark Developers
> List]"
> > >> <[hidden email] <http://user/SendEmail.jtp?type=node&node=9035&i=0>>
> > >> wrote:
> > >>
> > >>> Hi Anant,
> > >>> sorry for my late reply. Thank you for taking time and reviewing it.
> > >>>
> > >>> I have few comments on first issue.
> > >>>
> > >>> You are correct on the string (csv) part. But we can not take input
> of
> > >>> type you mentioned. We calculate frequency in our function.
> Otherwise
> > user
> > >>> has to do all this computation. I realize that taking a RDD[Vector]
> > would
> > >>> be general enough for all. What do you say?
> > >>>
> > >>> I agree on rest all the issues. I will correct them soon and post
> it.
> > >>> I have a doubt on test cases. Where should I put data while giving
> test
> > >>> scripts? or should i generate synthetic data for testing with in the
> > >>> scripts, how does this work?
> > >>>
> > >>> Regards,
> > >>> Ashutosh
> > >>>
> > >>> ------------------------------
> > >>>  If you reply to this email, your message will be added to the
> > >>> discussion below:
> > >>>
> > >>>
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9034.html
> > >>>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> > >>> Detection, click here.
> > >>> NAML
> > >>> <
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> > >
> > >>>
> > >>
> > >>
> > >> ------------------------------
> > >>  If you reply to this email, your message will be added to the
> > >> discussion below:
> > >>
> > >>
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9035.html
> > >>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> > >> Detection, click here.
> > >> NAML
> > >> <
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> > >
> > >>
> > >>
> > >> ------------------------------
> > >>  If you reply to this email, your message will be added to the
> > >> discussion below:
> > >>
> > >>
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9036.html
> > >>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> > >> Detection, click here.
> > >> NAML
> > >> <
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> > >
> > >>
> > >
> > >
> > > ------------------------------
> > >  If you reply to this email, your message will be added to the
> discussion
> > > below:
> > >
> > >
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9037.html
> > >  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> > Detection, click
> > > here.
> > > NAML
> > > <
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> > >
> > >
> > >
> > > ------------------------------
> > >  If you reply to this email, your message will be added to the
> discussion
> > > below:
> > >
> > >
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9083.html
> > >  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> > Detection, click
> > > here
> > > <
> > >
> > > .
> > > NAML
> > > <
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> > >
> > >
> >
> >
> >
> >
> > --
> > View this message in context:
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9095.html
> > Sent from the Apache Spark Developers List mailing list archive at
> > Nabble.com.
> >
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9239.html
>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click
> here.
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9286.html
>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click
> here
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=8880&code=YW5hbnQuYXN0eUBnbWFpbC5jb218ODg4MHwxOTU2OTQ5NjMy>
> .
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9287.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: [MLlib] Contributing Algorithm for Outlier Detection

Posted by Ashutosh <as...@iiitb.org>.

Hi Mayur,

Vector data types are implemented using breeze library, it is presented at

.../org/apache/spark/mllib/linalg


Anant,

One restriction I found that a vector can only be of 'Double', so it actually restrict the user.

What are you thoughts on LibSVM format?

Thanks for the comments, I was just trying to get away from those increment /decrement functions, they look ugly. Points are noted. I'll try to fix them soon. Tests are also required for the code.


Regards,

Ashutosh


________________________________
From: Mayur Rustagi [via Apache Spark Developers List] <ml...@n3.nabble.com>
Sent: Saturday, November 8, 2014 12:52 PM
To: Ashutosh Trivedi (MT2013030)
Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection

>
> We should take a vector instead giving the user flexibility to decide
> data source/ type

What do you mean by vector datatype exactly?

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>


On Wed, Nov 5, 2014 at 6:45 AM, slcclimber <[hidden email]</user/SendEmail.jtp?type=node&node=9239&i=0>> wrote:

> Ashutosh,
> I still see a few issues.
> 1. On line 112 you are counting using a counter. Since this will happen in
> a RDD the counter will cause issues. Also that is not good functional style
> to use a filter function with a side effect.
> You could use randomSplit instead. This does not the same thing without the
> side effect.
> 2. Similar shared usage of j in line 102 is going to be an issue as well.
> also hash seed does not need to be sequential it could be randomly
> generated or hashed on the values.
> 3. The compute function and trim scores still runs on a comma separeated
> RDD. We should take a vector instead giving the user flexibility to decide
> data source/ type. what if we want data from hive tables or parquet or JSON
> or avro formats. This is a very restrictive format. With vectors the user
> has the choice of taking in whatever data format and converting them to
> vectors insteda of reading json files creating a csv file and then workig
> on that.
> 4. Similar use of counters in 54 and 65 is an issue.
> Basically the shared state counters is a huge issue that does not scale.
> Since the processing of RDD's is distributed and the value j lives on the
> master.
>
> Anant
>
>
>
> On Tue, Nov 4, 2014 at 7:22 AM, Ashutosh [via Apache Spark Developers List]
> <[hidden email]</user/SendEmail.jtp?type=node&node=9239&i=1>> wrote:
>
> >  Anant,
> >
> > I got rid of those increment/ decrements functions and now code is much
> > cleaner. Please check. All your comments have been looked after.
> >
> >
> >
> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
> >
> >
> >  _Ashu
> >
> > <
> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
> >
> >   Outlier-Detection-with-AVF-Spark/OutlierWithAVFModel.scala at master ·
> > codeAshu/Outlier-Detection-with-AVF-Spark · GitHub
> >  Contribute to Outlier-Detection-with-AVF-Spark development by creating
> an
> > account on GitHub.
> >  Read more...
> > <
> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
> >
> >
> >  ------------------------------
> > *From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden
> > email] <http://user/SendEmail.jtp?type=node&node=9083&i=0>>
> > *Sent:* Friday, October 31, 2014 10:09 AM
> > *To:* Ashutosh Trivedi (MT2013030)
> > *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection
> >
> >
> > You should create a jira ticket to go with it as well.
> > Thanks
> > On Oct 30, 2014 10:38 PM, "Ashutosh [via Apache Spark Developers List]"
> <[hidden
> > email] <http://user/SendEmail.jtp?type=node&node=9037&i=0>> wrote:
> >
> >>  Okay. I'll try it and post it soon with test case. After that I think
> >> we can go ahead with the PR.
> >>  ------------------------------
> >> *From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden
> >> email] <http://user/SendEmail.jtp?type=node&node=9036&i=0>>
> >> *Sent:* Friday, October 31, 2014 10:03 AM
> >> *To:* Ashutosh Trivedi (MT2013030)
> >> *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection
> >>
> >>
> >> Ashutosh,
> >> A vector would be a good idea vectors are used very frequently.
> >> Test data is usually stored in the spark/data/mllib folder
> >>  On Oct 30, 2014 10:31 PM, "Ashutosh [via Apache Spark Developers List]"
> >> <[hidden email] <http://user/SendEmail.jtp?type=node&node=9035&i=0>>
> >> wrote:
> >>
> >>> Hi Anant,
> >>> sorry for my late reply. Thank you for taking time and reviewing it.
> >>>
> >>> I have few comments on first issue.
> >>>
> >>> You are correct on the string (csv) part. But we can not take input of
> >>> type you mentioned. We calculate frequency in our function. Otherwise
> user
> >>> has to do all this computation. I realize that taking a RDD[Vector]
> would
> >>> be general enough for all. What do you say?
> >>>
> >>> I agree on rest all the issues. I will correct them soon and post it.
> >>> I have a doubt on test cases. Where should I put data while giving test
> >>> scripts? or should i generate synthetic data for testing with in the
> >>> scripts, how does this work?
> >>>
> >>> Regards,
> >>> Ashutosh
> >>>
> >>> ------------------------------
> >>>  If you reply to this email, your message will be added to the
> >>> discussion below:
> >>>
> >>>
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9034.html
> >>>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> >>> Detection, click here.
> >>> NAML
> >>> <
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >>>
> >>
> >>
> >> ------------------------------
> >>  If you reply to this email, your message will be added to the
> >> discussion below:
> >>
> >>
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9035.html
> >>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> >> Detection, click here.
> >> NAML
> >> <
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >>
> >>
> >> ------------------------------
> >>  If you reply to this email, your message will be added to the
> >> discussion below:
> >>
> >>
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9036.html
> >>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> >> Detection, click here.
> >> NAML
> >> <
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >>
> >
> >
> > ------------------------------
> >  If you reply to this email, your message will be added to the discussion
> > below:
> >
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9037.html
> >  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> Detection, click
> > here.
> > NAML
> > <
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >
> >
> > ------------------------------
> >  If you reply to this email, your message will be added to the discussion
> > below:
> >
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9083.html
> >  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> Detection, click
> > here
> > <
> >
> > .
> > NAML
> > <
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9095.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>


________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9239.html
To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=8880&code=YXNodXRvc2gudHJpdmVkaUBpaWl0Yi5vcmd8ODg4MHwtMzkzMzE5NzYx>.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9286.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: [MLlib] Contributing Algorithm for Outlier Detection

Posted by Mayur Rustagi <ma...@gmail.com>.

>
> We should take a vector instead giving the user flexibility to decide
> data source/ type

What do you mean by vector datatype exactly?

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>


On Wed, Nov 5, 2014 at 6:45 AM, slcclimber <an...@gmail.com> wrote:

> Ashutosh,
> I still see a few issues.
> 1. On line 112 you are counting using a counter. Since this will happen in
> a RDD the counter will cause issues. Also that is not good functional style
> to use a filter function with a side effect.
> You could use randomSplit instead. This does not the same thing without the
> side effect.
> 2. Similar shared usage of j in line 102 is going to be an issue as well.
> also hash seed does not need to be sequential it could be randomly
> generated or hashed on the values.
> 3. The compute function and trim scores still runs on a comma separeated
> RDD. We should take a vector instead giving the user flexibility to decide
> data source/ type. what if we want data from hive tables or parquet or JSON
> or avro formats. This is a very restrictive format. With vectors the user
> has the choice of taking in whatever data format and converting them to
> vectors insteda of reading json files creating a csv file and then workig
> on that.
> 4. Similar use of counters in 54 and 65 is an issue.
> Basically the shared state counters is a huge issue that does not scale.
> Since the processing of RDD's is distributed and the value j lives on the
> master.
>
> Anant
>
>
>
> On Tue, Nov 4, 2014 at 7:22 AM, Ashutosh [via Apache Spark Developers List]
> <ml...@n3.nabble.com> wrote:
>
> >  Anant,
> >
> > I got rid of those increment/ decrements functions and now code is much
> > cleaner. Please check. All your comments have been looked after.
> >
> >
> >
> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
> >
> >
> >  _Ashu
> >
> > <
> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
> >
> >   Outlier-Detection-with-AVF-Spark/OutlierWithAVFModel.scala at master ·
> > codeAshu/Outlier-Detection-with-AVF-Spark · GitHub
> >  Contribute to Outlier-Detection-with-AVF-Spark development by creating
> an
> > account on GitHub.
> >  Read more...
> > <
> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
> >
> >
> >  ------------------------------
> > *From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden
> > email] <http://user/SendEmail.jtp?type=node&node=9083&i=0>>
> > *Sent:* Friday, October 31, 2014 10:09 AM
> > *To:* Ashutosh Trivedi (MT2013030)
> > *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection
> >
> >
> > You should create a jira ticket to go with it as well.
> > Thanks
> > On Oct 30, 2014 10:38 PM, "Ashutosh [via Apache Spark Developers List]"
> <[hidden
> > email] <http://user/SendEmail.jtp?type=node&node=9037&i=0>> wrote:
> >
> >>  Okay. I'll try it and post it soon with test case. After that I think
> >> we can go ahead with the PR.
> >>  ------------------------------
> >> *From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden
> >> email] <http://user/SendEmail.jtp?type=node&node=9036&i=0>>
> >> *Sent:* Friday, October 31, 2014 10:03 AM
> >> *To:* Ashutosh Trivedi (MT2013030)
> >> *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection
> >>
> >>
> >> Ashutosh,
> >> A vector would be a good idea vectors are used very frequently.
> >> Test data is usually stored in the spark/data/mllib folder
> >>  On Oct 30, 2014 10:31 PM, "Ashutosh [via Apache Spark Developers List]"
> >> <[hidden email] <http://user/SendEmail.jtp?type=node&node=9035&i=0>>
> >> wrote:
> >>
> >>> Hi Anant,
> >>> sorry for my late reply. Thank you for taking time and reviewing it.
> >>>
> >>> I have few comments on first issue.
> >>>
> >>> You are correct on the string (csv) part. But we can not take input of
> >>> type you mentioned. We calculate frequency in our function. Otherwise
> user
> >>> has to do all this computation. I realize that taking a RDD[Vector]
> would
> >>> be general enough for all. What do you say?
> >>>
> >>> I agree on rest all the issues. I will correct them soon and post it.
> >>> I have a doubt on test cases. Where should I put data while giving test
> >>> scripts? or should i generate synthetic data for testing with in the
> >>> scripts, how does this work?
> >>>
> >>> Regards,
> >>> Ashutosh
> >>>
> >>> ------------------------------
> >>>  If you reply to this email, your message will be added to the
> >>> discussion below:
> >>>
> >>>
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9034.html
> >>>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> >>> Detection, click here.
> >>> NAML
> >>> <
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >>>
> >>
> >>
> >> ------------------------------
> >>  If you reply to this email, your message will be added to the
> >> discussion below:
> >>
> >>
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9035.html
> >>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> >> Detection, click here.
> >> NAML
> >> <
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >>
> >>
> >> ------------------------------
> >>  If you reply to this email, your message will be added to the
> >> discussion below:
> >>
> >>
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9036.html
> >>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> >> Detection, click here.
> >> NAML
> >> <
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >>
> >
> >
> > ------------------------------
> >  If you reply to this email, your message will be added to the discussion
> > below:
> >
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9037.html
> >  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> Detection, click
> > here.
> > NAML
> > <
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >
> >
> > ------------------------------
> >  If you reply to this email, your message will be added to the discussion
> > below:
> >
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9083.html
> >  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> Detection, click
> > here
> > <
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=8880&code=YW5hbnQuYXN0eUBnbWFpbC5jb218ODg4MHwxOTU2OTQ5NjMy
> >
> > .
> > NAML
> > <
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9095.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>

Re: [MLlib] Contributing Algorithm for Outlier Detection

Posted by slcclimber <an...@gmail.com>.

Ashutosh,
I still see a few issues.
1. On line 112 you are counting using a counter. Since this will happen in
a RDD the counter will cause issues. Also that is not good functional style
to use a filter function with a side effect.
You could use randomSplit instead. This does not the same thing without the
side effect.
2. Similar shared usage of j in line 102 is going to be an issue as well.
also hash seed does not need to be sequential it could be randomly
generated or hashed on the values.
3. The compute function and trim scores still runs on a comma separeated
RDD. We should take a vector instead giving the user flexibility to decide
data source/ type. what if we want data from hive tables or parquet or JSON
or avro formats. This is a very restrictive format. With vectors the user
has the choice of taking in whatever data format and converting them to
vectors insteda of reading json files creating a csv file and then workig
on that.
4. Similar use of counters in 54 and 65 is an issue.
Basically the shared state counters is a huge issue that does not scale.
Since the processing of RDD's is distributed and the value j lives on the
master.

Anant



On Tue, Nov 4, 2014 at 7:22 AM, Ashutosh [via Apache Spark Developers List]
<ml...@n3.nabble.com> wrote:

>  Anant,
>
> I got rid of those increment/ decrements functions and now code is much
> cleaner. Please check. All your comments have been looked after.
>
>
> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
>
>
>  _Ashu
>
> <https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala>
>   Outlier-Detection-with-AVF-Spark/OutlierWithAVFModel.scala at master ·
> codeAshu/Outlier-Detection-with-AVF-Spark · GitHub
>  Contribute to Outlier-Detection-with-AVF-Spark development by creating an
> account on GitHub.
>  Read more...
> <https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala>
>
>  ------------------------------
> *From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden
> email] <http://user/SendEmail.jtp?type=node&node=9083&i=0>>
> *Sent:* Friday, October 31, 2014 10:09 AM
> *To:* Ashutosh Trivedi (MT2013030)
> *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection
>
>
> You should create a jira ticket to go with it as well.
> Thanks
> On Oct 30, 2014 10:38 PM, "Ashutosh [via Apache Spark Developers List]" <[hidden
> email] <http://user/SendEmail.jtp?type=node&node=9037&i=0>> wrote:
>
>>  Okay. I'll try it and post it soon with test case. After that I think
>> we can go ahead with the PR.
>>  ------------------------------
>> *From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden
>> email] <http://user/SendEmail.jtp?type=node&node=9036&i=0>>
>> *Sent:* Friday, October 31, 2014 10:03 AM
>> *To:* Ashutosh Trivedi (MT2013030)
>> *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection
>>
>>
>> Ashutosh,
>> A vector would be a good idea vectors are used very frequently.
>> Test data is usually stored in the spark/data/mllib folder
>>  On Oct 30, 2014 10:31 PM, "Ashutosh [via Apache Spark Developers List]"
>> <[hidden email] <http://user/SendEmail.jtp?type=node&node=9035&i=0>>
>> wrote:
>>
>>> Hi Anant,
>>> sorry for my late reply. Thank you for taking time and reviewing it.
>>>
>>> I have few comments on first issue.
>>>
>>> You are correct on the string (csv) part. But we can not take input of
>>> type you mentioned. We calculate frequency in our function. Otherwise user
>>> has to do all this computation. I realize that taking a RDD[Vector] would
>>> be general enough for all. What do you say?
>>>
>>> I agree on rest all the issues. I will correct them soon and post it.
>>> I have a doubt on test cases. Where should I put data while giving test
>>> scripts? or should i generate synthetic data for testing with in the
>>> scripts, how does this work?
>>>
>>> Regards,
>>> Ashutosh
>>>
>>> ------------------------------
>>>  If you reply to this email, your message will be added to the
>>> discussion below:
>>>
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9034.html
>>>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>>> Detection, click here.
>>> NAML
>>> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>
>>
>>
>> ------------------------------
>>  If you reply to this email, your message will be added to the
>> discussion below:
>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9035.html
>>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>> Detection, click here.
>> NAML
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>>
>> ------------------------------
>>  If you reply to this email, your message will be added to the
>> discussion below:
>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9036.html
>>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>> Detection, click here.
>> NAML
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9037.html
>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click
> here.
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9083.html
>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click
> here
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=8880&code=YW5hbnQuYXN0eUBnbWFpbC5jb218ODg4MHwxOTU2OTQ5NjMy>
> .
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9095.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: [MLlib] Contributing Algorithm for Outlier Detection

Posted by Ashutosh <as...@iiitb.org>.

Anant,

I got rid of those increment/ decrements functions and now code is much cleaner. Please check. All your comments have been looked after.

https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala


_Ashu

[https://avatars3.githubusercontent.com/u/5406975?v=2&s=400]<https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala>

Outlier-Detection-with-AVF-Spark/OutlierWithAVFModel.scala at master · codeAshu/Outlier-Detection-with-AVF-Spark · GitHub
Contribute to Outlier-Detection-with-AVF-Spark development by creating an account on GitHub.
Read more...<https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala>


________________________________
From: slcclimber [via Apache Spark Developers List] <ml...@n3.nabble.com>
Sent: Friday, October 31, 2014 10:09 AM
To: Ashutosh Trivedi (MT2013030)
Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection


You should create a jira ticket to go with it as well.
Thanks

On Oct 30, 2014 10:38 PM, "Ashutosh [via Apache Spark Developers List]" <[hidden email]</user/SendEmail.jtp?type=node&node=9037&i=0>> wrote:

?Okay. I'll try it and post it soon with test case. After that I think we can go ahead with the PR.

________________________________
From: slcclimber [via Apache Spark Developers List] <ml-node+[hidden email]<http://user/SendEmail.jtp?type=node&node=9036&i=0>>
Sent: Friday, October 31, 2014 10:03 AM
To: Ashutosh Trivedi (MT2013030)
Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection


Ashutosh,
A vector would be a good idea vectors are used very frequently.
Test data is usually stored in the spark/data/mllib folder

On Oct 30, 2014 10:31 PM, "Ashutosh [via Apache Spark Developers List]" <[hidden email]<http://user/SendEmail.jtp?type=node&node=9035&i=0>> wrote:
Hi Anant,
sorry for my late reply. Thank you for taking time and reviewing it.

I have few comments on first issue.

You are correct on the string (csv) part. But we can not take input of type you mentioned. We calculate frequency in our function. Otherwise user has to do all this computation. I realize that taking a RDD[Vector] would be general enough for all. What do you say?

I agree on rest all the issues. I will correct them soon and post it.
I have a doubt on test cases. Where should I put data while giving test scripts? or should i generate synthetic data for testing with in the scripts, how does this work?

Regards,
Ashutosh

________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9034.html
To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>


________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9035.html
To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>


________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9036.html
To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>


________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9037.html
To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=8880&code=YXNodXRvc2gudHJpdmVkaUBpaWl0Yi5vcmd8ODg4MHwtMzkzMzE5NzYx>.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9083.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: [MLlib] Contributing Algorithm for Outlier Detection

Posted by Ashutosh <as...@iiitb.org>.

A?lready done. Here is the link

 https://issues.apache.org/jira/browse/SPARK-4038

________________________________
From: slcclimber [via Apache Spark Developers List] <ml...@n3.nabble.com>
Sent: Friday, October 31, 2014 10:09 AM
To: Ashutosh Trivedi (MT2013030)
Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection

You should create a jira ticket to go with it as well.
Thanks

On Oct 30, 2014 10:38 PM, "Ashutosh [via Apache Spark Developers List]" <[hidden email]</user/SendEmail.jtp?type=node&node=9037&i=0>> wrote:

?Okay. I'll try it and post it soon with test case. After that I think we can go ahead with the PR.

________________________________
From: slcclimber [via Apache Spark Developers List] <ml-node+[hidden email]<http://user/SendEmail.jtp?type=node&node=9036&i=0>>
Sent: Friday, October 31, 2014 10:03 AM
To: Ashutosh Trivedi (MT2013030)
Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection

Ashutosh,
A vector would be a good idea vectors are used very frequently.
Test data is usually stored in the spark/data/mllib folder

On Oct 30, 2014 10:31 PM, "Ashutosh [via Apache Spark Developers List]" <[hidden email]<http://user/SendEmail.jtp?type=node&node=9035&i=0>> wrote:
Hi Anant,
sorry for my late reply. Thank you for taking time and reviewing it.

I have few comments on first issue.

You are correct on the string (csv) part. But we can not take input of type you mentioned. We calculate frequency in our function. Otherwise user has to do all this computation. I realize that taking a RDD[Vector] would be general enough for all. What do you say?

I agree on rest all the issues. I will correct them soon and post it.
I have a doubt on test cases. Where should I put data while giving test scripts? or should i generate synthetic data for testing with in the scripts, how does this work?

Regards,
Ashutosh

________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9034.html
To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>

________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9035.html
To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>

________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9036.html
To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>

________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9037.html
To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=8880&code=YXNodXRvc2gudHJpdmVkaUBpaWl0Yi5vcmd8ODg4MHwtMzkzMzE5NzYx>.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>

--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9038.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: [MLlib] Contributing Algorithm for Outlier Detection

Posted by Ashutosh <as...@iiitb.org>.

?Okay. I'll try it and post it soon with test case. After that I think we can go ahead with the PR.

________________________________
From: slcclimber [via Apache Spark Developers List] <ml...@n3.nabble.com>
Sent: Friday, October 31, 2014 10:03 AM
To: Ashutosh Trivedi (MT2013030)
Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection

Ashutosh,
A vector would be a good idea vectors are used very frequently.
Test data is usually stored in the spark/data/mllib folder

On Oct 30, 2014 10:31 PM, "Ashutosh [via Apache Spark Developers List]" <[hidden email]</user/SendEmail.jtp?type=node&node=9035&i=0>> wrote:
Hi Anant,
sorry for my late reply. Thank you for taking time and reviewing it.

I have few comments on first issue.

You are correct on the string (csv) part. But we can not take input of type you mentioned. We calculate frequency in our function. Otherwise user has to do all this computation. I realize that taking a RDD[Vector] would be general enough for all. What do you say?

I agree on rest all the issues. I will correct them soon and post it.
I have a doubt on test cases. Where should I put data while giving test scripts? or should i generate synthetic data for testing with in the scripts, how does this work?

Regards,
Ashutosh

________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9034.html
To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>

________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9035.html
To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=8880&code=YXNodXRvc2gudHJpdmVkaUBpaWl0Yi5vcmd8ODg4MHwtMzkzMzE5NzYx>.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>

--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9036.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: [MLlib] Contributing Algorithm for Outlier Detection

Posted by slcclimber <an...@gmail.com>.

Ashutosh,
A vector would be a good idea vectors are used very frequently.
Test data is usually stored in the spark/data/mllib folder
 On Oct 30, 2014 10:31 PM, "Ashutosh [via Apache Spark Developers List]" <
ml-node+s1001551n9034h67@n3.nabble.com> wrote:

> Hi Anant,
> sorry for my late reply. Thank you for taking time and reviewing it.
>
> I have few comments on first issue.
>
> You are correct on the string (csv) part. But we can not take input of
> type you mentioned. We calculate frequency in our function. Otherwise user
> has to do all this computation. I realize that taking a RDD[Vector] would
> be general enough for all. What do you say?
>
> I agree on rest all the issues. I will correct them soon and post it.
> I have a doubt on test cases. Where should I put data while giving test
> scripts? or should i generate synthetic data for testing with in the
> scripts, how does this work?
>
> Regards,
> Ashutosh
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9034.html
>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click
> here
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=8880&code=YW5hbnQuYXN0eUBnbWFpbC5jb218ODg4MHwxOTU2OTQ5NjMy>
> .
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9035.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: [MLlib] Contributing Algorithm for Outlier Detection

Posted by Ashutosh <as...@iiitb.org>.

Hi Anant,
sorry for my late reply. Thank you for taking time and reviewing it.
 
I have few comments on first issue.

You are correct on the string (csv) part. But we can not take input of type
you mentioned. We calculate frequency in our function. Otherwise user has to
do all this computation. I realize that taking a RDD[Vector] would be
general enough for all. What do you say?

I agree on rest all the issues. I will correct them soon and post it.
I have a doubt on test cases. Where should I put data while giving test
scripts? or should i generate synthetic data for testing with in the
scripts, how does this work?

Regards,
Ashutosh



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9034.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [MLlib] Contributing Algorithm for Outlier Detection

Posted by slcclimber <an...@gmail.com>.

Ashu,
There is one main issue and  a few stylistic/ grammatical things I noticed.
1> You take and rdd or type String which you expect to be comma separated.
This limits usability since the user will have to convert their RDD to that
format only for you to split it on string.
It would make more sense to take an RDD of type (col_num:Int ,
attr_value:Int), frequency:Int) 
You could also use Long instead of Int.

2> the increment functions could be more along the lines of 
    def incr = {count += 1; count}
which is ina a more functional style

3> reset functions could be simply 
    def reset_count = count = 1L

4> in
https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala#L108
You have a key of type string which is basically a string of form "number,
string"
when you could just have a tuple of the form (i:Int, word:String)

5? the lines exceed the style guides 100 character length

Thanks
Anant



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p8992.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [MLlib] Contributing Algorithm for Outlier Detection

Posted by Ashutosh <as...@iiitb.org>.

Hi Anant,

Thank you for reviewing and helping us out. Please find the following link
where you can see the initial code.
https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala


The input file for the code should be in csv format. We have provided a
dataset there at the link.

We are currently facing the following style issues in the code(code is
working fine though) :

At line no 62 and 79 we have redundant functions and variables
(count_dataPoint, count_trimmedData) for giving  line numbers within the
function trimScores().
 
At line no 144 and 149 if we do not use two separate functions to increment
line numbers we get erroneous results . Is there any alternative way of
handling that?

We think that it because of scala clousers where any local variable which is
not in RDD doesn't get updated in subsequent pairRDDFunctions.


Regards,
Ashutosh & Kaushik 



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p8990.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [MLlib] Contributing Algorithm for Outlier Detection

Posted by Ashutosh <as...@iiitb.org>.

Hi,
We are ready with the initial code. Where can I submit it for review ? I
want to get it reviewed before testing 
it at scale.
Also, I see that most of the algorithms take data as RDD[LabeledPoint] . How
should we take input for this since there are no labels.

Can any body help me out with these issues. Here is the JIRA opened for it.
https://issues.apache.org/jira/browse/SPARK-4038

Regards,
Ashutosh 



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p8935.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [MLlib] Contributing Algorithm for Outlier Detection

Posted by Ashutosh <as...@iiitb.org>.

Hi Xiangrui,

Thanks for the reply. AVF is not so difficult to implement in parallel. It
just calculate the frequency of each attribute and calculate the overall
'score' of the datapoint. Low score points are considered outlier. One
advantage of it is that it does not calculate distance, so in that sense it
is general.

I have to look at the one you pointed out. It calculates Hat matrix and I am
not sure about calculating Hat matrix in parallel, but Mahalanobis Distance
can be implemented. http://en.wikipedia.org/wiki/Mahalanobis_distance 

I have Opened the JIRA.
 https://issues.apache.org/jira/browse/SPARK-4038
Lets discuss it over there.

Regards,
Ashutosh



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p8894.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [MLlib] Contributing Algorithm for Outlier Detection

Posted by Xiangrui Meng <me...@gmail.com>.

Hi Ashutosh,

The process you described is correct, with details documented in
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
. There is no outlier detection algorithm in MLlib. Before you start
coding, please open an JIRA and let's discuss which algorithms are
appropriate to include, because there are many outlier detection
algorithms. I'm not sure which one is general enough and easy to
implement in parallel. For example, I'm not familiar with the
algorithm you mentioned, while the one I'm familiar with is based on
leverage scores: http://en.wikipedia.org/wiki/Leverage_(statistics)

Best,
Xiangrui

On Tue, Oct 21, 2014 at 2:23 AM, Ashutosh <as...@iiitb.org> wrote:
> Hi,
> I am new to Apache Spark (any open source project). I want to contribute to
> it. I found that MLlib has no algorithm for outlier detection yet.  By
> literature review I found the algorithm Attribute Value Frequency (AVF) is
> promising. Here is the link  DOI: 10.1109/ICTAI.2007.125
>
> By following the process I figured out that, I have to open a new feature
> request at JIRA (https://issues.apache.org/jira/browse/SPARK). Also, I have
> checked that no other issue is opened on "outlier detection".
>
> I want to know is it the right way to go? What project owners have in mind
> about outlier detection? Also is anybody working on parallel K nearest
> neighbour?
>
> Apart from opening up the feature request then pull request from git, How to
> provide the test cases?
>
> Suggestions and guidance are welcome.
>
> Thanks,
> Ashutosh
>
>
>
> --
> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org