You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mobius ReX <ao...@gmail.com> on 2016/09/01 17:47:59 UTC

What's the best way to detect and remove outliers in a table?

Given a table with hundreds of columns mixed with both categorical and
numerical attributes, and the distribution of values is unknown, what's the
best way to detect outliers?

For example, given a table
Category  Price
A                 1
A                 1.3
A                 1000000
C                  1

If category C above appears rarely, for example less than 0.1%, then we
should remove all rows with Category=C.

Assuming continuous distribution, if Price of Category A is rarely above
1000, then 1000000 above is another outlier.

What's the best scalable way to remove all outliers? It would be laborious
to plot the distribution curve for each numerical column, and histogram for
each categorical column.

Any tips would be greatly appreciated!

Regards
Rex