You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by wilson <wi...@4shield.net> on 2022/05/01 08:14:43 UTC
how spark handle the abnormal values
Hello
my dataset has abnormal values in the column whose normal values are
numeric. I can select them as:
scala> df.select("up_votes").filter($"up_votes".rlike(regex)).show()
+---------------+
| up_votes|
+---------------+
| <|
| <|
| fx-|
| OP|
| \|
| v|
| :O|
| y|
| :O|
| ncurs|
| )|
| )|
| X|
| -1|
|':>?< ./ '[]\~`|
| enc3|
| X|
| -|
| X|
| N|
+---------------+
only showing top 20 rows
Even there are those abnormal values in the column, spark can still
aggregate them. as you can see below.
scala> df.agg(avg("up_votes")).show()
+-----------------+
| avg(up_votes)|
+-----------------+
|65.18445431897453|
+-----------------+
so how spark handle the abnormal values in a numeric column? just ignore
them?
Thank you.
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org
Re: how spark handle the abnormal values
Posted by Artemis User <ar...@dtechspace.com>.
Your test result just gave the verdict so #2 is the answer - Spark
ignores those non-numeric rows completely when aggregating the average.
On 5/1/22 8:20 PM, wilson wrote:
> I did a small test as follows.
>
> scala> df.printSchema()
> root
> |-- fruit: string (nullable = true)
> |-- number: string (nullable = true)
>
>
> scala> df.show()
> +------+------+
> | fruit|number|
> +------+------+
> | apple| 2|
> |orange| 5|
> |cherry| 7|
> | plum| xyz|
> +------+------+
>
>
> scala> df.agg(avg("number")).show()
> +-----------------+
> | avg(number)|
> +-----------------+
> |4.666666666666667|
> +-----------------+
>
>
> As you see, the "number" column is string type, and there is a
> abnormal value in it.
>
> But for these two cases spark still handles the result pretty well. So
> I guess:
>
> 1) spark can make some auto translation from string to numeric when
> aggregating.
> 2) spark ignore those abnormal values automatically when calculating
> the relevant stuff.
>
> Am I right? thank you.
>
> wilson
>
>
>
>
> wilson wrote:
>> my dataset has abnormal values in the column whose normal values are
>> numeric. I can select them as:
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org
Re: how spark handle the abnormal values
Posted by wilson <wi...@4shield.net>.
Thanks Mich.
But many original datasource has the abnormal values included from my
experience.
I already used rlike and filter to implement the data cleaning as my
this writing:
https://bigcount.xyz/calculate-urban-words-vote-in-spark.html
What I am surprised is that spark does the string to numeric converting
automatically and ignore those non-numeric columns. Based on this, my
data cleaning seems meaningless.
Thanks.
Mich Talebzadeh wrote:
> Agg and ave are numeric functions dealing with the numeric values. Why
> is column number defined as String type?
>
> Do you perform data cleaning beforehand by any chance? It is good practice.
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org
Re: how spark handle the abnormal values
Posted by Mich Talebzadeh <mi...@gmail.com>.
Agg and ave are numeric functions dealing with the numeric values. Why is
column number defined as String type?
Do you perform data cleaning beforehand by any chance? It is good practice.
Alternatively you can use the rlike() function to filter rows that have
numeric values in a column..
scala> val data = Seq((1,"123456","123456"),
| (2,"3456234","ABCD12345"),(3,"48973456","ABCDEFGH"))
data: Seq[(Int, String, String)] = List((1,123456,123456),
(2,3456234,ABCD12345), (3,48973456,ABCDEFGH))
scala> val df = data.toDF("id","all_numeric","alphanumeric")
df: org.apache.spark.sql.DataFrame = [id: int, all_numeric: string ... 1
more field]
scala> df.show()
+---+-----------+------------+
| id|all_numeric|alphanumeric|
+---+-----------+------------+
| 1| 123456| 123456|
| 2| 3456234| ABCD12345|
| 3| 48973456| ABCDEFGH|
+---+-----------+------------+
scala> df.filter(col("alphanumeric")
| .rlike("^[0-9]*$")
| ).show()
+---+-----------+------------+
| id|all_numeric|alphanumeric|
+---+-----------+------------+
| 1| 123456| 123456|
+---+-----------+------------+
HTH
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mich_Talebzadeh
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.
On Mon, 2 May 2022 at 01:21, wilson <wi...@4shield.net> wrote:
> I did a small test as follows.
>
> scala> df.printSchema()
> root
> |-- fruit: string (nullable = true)
> |-- number: string (nullable = true)
>
>
> scala> df.show()
> +------+------+
> | fruit|number|
> +------+------+
> | apple| 2|
> |orange| 5|
> |cherry| 7|
> | plum| xyz|
> +------+------+
>
>
> scala> df.agg(avg("number")).show()
> +-----------------+
> | avg(number)|
> +-----------------+
> |4.666666666666667|
> +-----------------+
>
>
> As you see, the "number" column is string type, and there is a abnormal
> value in it.
>
> But for these two cases spark still handles the result pretty well. So I
> guess:
>
> 1) spark can make some auto translation from string to numeric when
> aggregating.
> 2) spark ignore those abnormal values automatically when calculating the
> relevant stuff.
>
> Am I right? thank you.
>
> wilson
>
>
>
>
> wilson wrote:
> > my dataset has abnormal values in the column whose normal values are
> > numeric. I can select them as:
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>
Re: how spark handle the abnormal values
Posted by wilson <wi...@4shield.net>.
I did a small test as follows.
scala> df.printSchema()
root
|-- fruit: string (nullable = true)
|-- number: string (nullable = true)
scala> df.show()
+------+------+
| fruit|number|
+------+------+
| apple| 2|
|orange| 5|
|cherry| 7|
| plum| xyz|
+------+------+
scala> df.agg(avg("number")).show()
+-----------------+
| avg(number)|
+-----------------+
|4.666666666666667|
+-----------------+
As you see, the "number" column is string type, and there is a abnormal
value in it.
But for these two cases spark still handles the result pretty well. So I
guess:
1) spark can make some auto translation from string to numeric when
aggregating.
2) spark ignore those abnormal values automatically when calculating the
relevant stuff.
Am I right? thank you.
wilson
wilson wrote:
> my dataset has abnormal values in the column whose normal values are
> numeric. I can select them as:
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org