You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by wilson <wi...@4shield.net> on 2022/05/01 08:14:43 UTC

how spark handle the abnormal values

Hello

my dataset has abnormal values in the column whose normal values are 
numeric. I can select them as:

scala> df.select("up_votes").filter($"up_votes".rlike(regex)).show()
+---------------+
|       up_votes|
+---------------+
|              <|
|              <|
|            fx-|
|             OP|
|              \|
|              v|
|             :O|
|              y|
|             :O|
|          ncurs|
|              )|
|              )|
|              X|
|             -1|
|':>?< ./ '[]\~`|
|           enc3|
|              X|
|              -|
|              X|
|              N|
+---------------+
only showing top 20 rows


Even there are those abnormal values in the column, spark can still 
aggregate them. as you can see below.


scala> df.agg(avg("up_votes")).show()
+-----------------+ 

|    avg(up_votes)|
+-----------------+
|65.18445431897453|
+-----------------+

so how spark handle the abnormal values in a numeric column? just ignore 
them?


Thank you.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: how spark handle the abnormal values

Posted by Artemis User <ar...@dtechspace.com>.

Your test result just gave the verdict so #2 is the answer - Spark 
ignores those non-numeric rows completely when aggregating the average.

On 5/1/22 8:20 PM, wilson wrote:
> I did a small test as follows.
>
> scala> df.printSchema()
> root
>  |-- fruit: string (nullable = true)
>  |-- number: string (nullable = true)
>
>
> scala> df.show()
> +------+------+
> | fruit|number|
> +------+------+
> | apple|     2|
> |orange|     5|
> |cherry|     7|
> |  plum|   xyz|
> +------+------+
>
>
> scala> df.agg(avg("number")).show()
> +-----------------+
> |      avg(number)|
> +-----------------+
> |4.666666666666667|
> +-----------------+
>
>
> As you see, the "number" column is string type, and there is a 
> abnormal value in it.
>
> But for these two cases spark still handles the result pretty well. So 
> I guess:
>
> 1) spark can make some auto translation from string to numeric when 
> aggregating.
> 2) spark ignore those abnormal values automatically when calculating 
> the relevant stuff.
>
> Am I right? thank you.
>
> wilson
>
>
>
>
> wilson wrote:
>> my dataset has abnormal values in the column whose normal values are 
>> numeric. I can select them as:
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: how spark handle the abnormal values

Posted by wilson <wi...@4shield.net>.

Thanks Mich.
But many original datasource has the abnormal values included from my 
experience.
I already used rlike and filter to implement the data cleaning as my 
this writing:
https://bigcount.xyz/calculate-urban-words-vote-in-spark.html

What I am surprised is that spark does the string to numeric converting 
automatically and ignore those non-numeric columns. Based on this, my 
data cleaning seems meaningless.

Thanks.

Mich Talebzadeh wrote:
> Agg and ave are numeric functions dealing with the numeric values. Why 
> is column number defined as String type?
> 
> Do you perform data cleaning beforehand by any chance? It is good practice.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: how spark handle the abnormal values

Posted by Mich Talebzadeh <mi...@gmail.com>.

Agg and ave are numeric functions dealing with the numeric values. Why is
column number defined as String type?

Do you perform data cleaning beforehand by any chance? It is good practice.

Alternatively you can use the rlike() function to filter rows that have
numeric values in a column..


scala> val data = Seq((1,"123456","123456"),

     |   (2,"3456234","ABCD12345"),(3,"48973456","ABCDEFGH"))

data: Seq[(Int, String, String)] = List((1,123456,123456),
(2,3456234,ABCD12345), (3,48973456,ABCDEFGH))


scala> val df = data.toDF("id","all_numeric","alphanumeric")

df: org.apache.spark.sql.DataFrame = [id: int, all_numeric: string ... 1
more field]


scala> df.show()

+---+-----------+------------+

| id|all_numeric|alphanumeric|

+---+-----------+------------+

|  1|     123456|      123456|

|  2|    3456234|   ABCD12345|

|  3|   48973456|    ABCDEFGH|

+---+-----------+------------+

scala> df.filter(col("alphanumeric")
     |     .rlike("^[0-9]*$")
     |   ).show()
+---+-----------+------------+
| id|all_numeric|alphanumeric|
+---+-----------+------------+
|  1|     123456|      123456|
+---+-----------+------------+


HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 2 May 2022 at 01:21, wilson <wi...@4shield.net> wrote:

> I did a small test as follows.
>
> scala> df.printSchema()
> root
>   |-- fruit: string (nullable = true)
>   |-- number: string (nullable = true)
>
>
> scala> df.show()
> +------+------+
> | fruit|number|
> +------+------+
> | apple|     2|
> |orange|     5|
> |cherry|     7|
> |  plum|   xyz|
> +------+------+
>
>
> scala> df.agg(avg("number")).show()
> +-----------------+
> |      avg(number)|
> +-----------------+
> |4.666666666666667|
> +-----------------+
>
>
> As you see, the "number" column is string type, and there is a abnormal
> value in it.
>
> But for these two cases spark still handles the result pretty well. So I
> guess:
>
> 1) spark can make some auto translation from string to numeric when
> aggregating.
> 2) spark ignore those abnormal values automatically when calculating the
> relevant stuff.
>
> Am I right? thank you.
>
> wilson
>
>
>
>
> wilson wrote:
> > my dataset has abnormal values in the column whose normal values are
> > numeric. I can select them as:
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: how spark handle the abnormal values

Posted by wilson <wi...@4shield.net>.

I did a small test as follows.

scala> df.printSchema()
root
  |-- fruit: string (nullable = true)
  |-- number: string (nullable = true)

scala> df.show()
+------+------+
| fruit|number|
+------+------+
| apple|     2|
|orange|     5|
|cherry|     7|
|  plum|   xyz|
+------+------+

scala> df.agg(avg("number")).show()
+-----------------+
|      avg(number)|
+-----------------+
|4.666666666666667|
+-----------------+

As you see, the "number" column is string type, and there is a abnormal 
value in it.

But for these two cases spark still handles the result pretty well. So I 
guess:

1) spark can make some auto translation from string to numeric when 
aggregating.
2) spark ignore those abnormal values automatically when calculating the 
relevant stuff.

Am I right? thank you.

wilson

wilson wrote:
> my dataset has abnormal values in the column whose normal values are 
> numeric. I can select them as:

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org