You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by anup ahire <ah...@gmail.com> on 2016/11/22 16:00:19 UTC

find outliers within data

I have a large data set with millions of records which is something like

Movie Likes Comments Shares Views
 A     100     10      20     30
 A     102     11      22     35
 A     104     12      25     45
 A     *103*   13     *24*    50
 B     200     10      20     30
 B     205    *9*      21     35
 B     *203*   12      29     42
 B     210     13     *23*   *39*

Likes, comments etc are rolling totals and they are suppose to increase. If
there is drop in any of this for a movie then its a bad data needs to be
identified.

I have initial thoughts about groupby movie and then sort within the group.
I am using dataframes in spark 1.6 for processing and it does not seem to
be achievable as there is no sorting within the grouped data in dataframe.

Buidling something for outlier detection can be another approach but
because of time constraint I have not explored it yet.

Is there anyway I can achieve this ?

Thanks !!

Re: find outliers within data

Posted by Yong Zhang <ja...@hotmail.com>.

Spark Dataframe window functions?


https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

Introducing Window Functions in Spark SQL - Databricks<https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html>
databricks.com
To use window functions, users need to mark that a function is used as a window function by either. Adding an OVER clause after a supported function in SQL, e.g. avg ...





________________________________
From: anup ahire <ah...@gmail.com>
Sent: Tuesday, November 22, 2016 11:00 AM
To: user@spark.apache.org
Subject: find outliers within data


I have a large data set with millions of records which is something like

Movie Likes Comments Shares Views
 A     100     10      20     30
 A     102     11      22     35
 A     104     12      25     45
 A     *103*   13     *24*    50
 B     200     10      20     30
 B     205    *9*      21     35
 B     *203*   12      29     42
 B     210     13     *23*   *39*


Likes, comments etc are rolling totals and they are suppose to increase. If there is drop in any of this for a movie then its a bad data needs to be identified.

I have initial thoughts about groupby movie and then sort within the group. I am using dataframes in spark 1.6 for processing and it does not seem to be achievable as there is no sorting within the grouped data in dataframe.

Buidling something for outlier detection can be another approach but because of time constraint I have not explored it yet.

Is there anyway I can achieve this ?

Thanks !!