You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Divya Gehlot <di...@gmail.com> on 2016/02/26 05:27:26 UTC

[Help]: DataframeNAfunction fill method throwing exception

Hi,
I have dataset which looks like below
name age
alice 35
bob null
peter 24
I need to replace null values of columns with 0
so  I referred Spark API DataframeNAfunctions.scala
<https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala>

 I tried the below code its throwing exception
scala> import org.apache.spark.sql.types.{StructType, StructField,
StringType,IntegerType,LongType,DoubleType, FloatType};
import org.apache.spark.sql.types.{StructType, StructField, StringType,
IntegerType, LongType, DoubleType, FloatType}

scala> val nulltestSchema = StructType(Seq(StructField("name", StringType,
false),StructField("age", DoubleType, true)))
nulltestSchema: org.apache.spark.sql.types.StructType =
StructType(StructField(name,StringType,false),
StructField(age,DoubleType,true))

scala> val dfnulltest =
sqlContext.read.format("com.databricks.spark.csv").option("header",
"true").schema(nulltestSchema).load("hdfs://
172.31.29.201:8020/TestDivya/Spark/nulltest.csv")
dfnulltest: org.apache.spark.sql.DataFrame = [name: string, age: double]

scala> val dfchangenull =
dfnulltest.na.fill(0,Seq("age")).select("name","age")
dfchangenull: org.apache.spark.sql.DataFrame = [name: string, age: double]

scala> dfchangenull.show
16/02/25 23:15:59 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2,
ip-172-31-22-135.ap-southeast-1.compute.internal):
java.text.ParseException: Unparseable number: "null"
        at java.text.NumberFormat.parse(NumberFormat.java:350)

Re: [Help]: DataframeNAfunction fill method throwing exception

Posted by ai he <he...@gmail.com>.

Hi Divya,

I guess the error is thrown from spark-csv. Spark-csv tries to parse string
"null" to double.

The workaround is to add nullValue option, like .option("nullValue",
"null"). But this nullValue feature is not included in current spark-csv
1.3. Just checkout the master of spark-csv and use the local ivy to make it
work.

Best,
Ai

On Thu, Feb 25, 2016 at 11:34 PM Divya Gehlot <di...@gmail.com>
wrote:

> Hi Jan ,
> Thanks for help.
> Alas..
> you suggestion also didnt work
>
> scala> import org.apache.spark.sql.types.{StructType, StructField,
>> StringType,IntegerType,LongType,DoubleType, FloatType};
>> import org.apache.spark.sql.types.{StructType, StructField, StringType,
>> IntegerType, LongType, DoubleType, FloatType}
>> scala> val nulltestSchema = StructType(Seq(StructField("name",
>> StringType, false),StructField("age", DoubleType, true)))
>> nulltestSchema: org.apache.spark.sql.types.StructType =
>> StructType(StructField(name,StringType,false),
>> StructField(age,DoubleType,true))
>>
> scala> val dfnulltest =
>> sqlContext.read.format("com.databricks.spark.csv").option("header",
>> "true").schema(nulltestSchema).load("hdfs://xx.xx.xx.xxx:8020/TestDivya/Spark/nulltest.csv")
>
>
>> dfnulltest: org.apache.spark.sql.DataFrame = [name: string, age: double]
>>
> scala> dfnulltest.selectExpr("name", "coalesce(age, 0) as age")
>> res0: org.apache.spark.sql.DataFrame = [name: string, age: double]
>> scala> val dfresult = dfnulltest.selectExpr("name", "coalesce(age, 0) as
>> age")
>> dfresult: org.apache.spark.sql.DataFrame = [name: string, age: double]
>> scala> dfresult.show
>
>
>  java.text.ParseException: Unparseable number: "null"
>         at java.text.NumberFormat.parse(NumberFormat.java:350)
>
>
> On 26 February 2016 at 15:15, Jan Štěrba <in...@jansterba.com> wrote:
>
>> just use coalesce function
>>
>> df.selectExpr("name", "coalesce(age, 0) as age")
>>
>> --
>> Jan Sterba
>> https://twitter.com/honzasterba | http://flickr.com/honzasterba |
>> http://500px.com/honzasterba
>>
>> On Fri, Feb 26, 2016 at 5:27 AM, Divya Gehlot <di...@gmail.com>
>> wrote:
>>
>>> Hi,
>>> I have dataset which looks like below
>>> name age
>>> alice 35
>>> bob null
>>> peter 24
>>> I need to replace null values of columns with 0
>>> so  I referred Spark API DataframeNAfunctions.scala
>>> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala>
>>>
>>>  I tried the below code its throwing exception
>>> scala> import org.apache.spark.sql.types.{StructType, StructField,
>>> StringType,IntegerType,LongType,DoubleType, FloatType};
>>> import org.apache.spark.sql.types.{StructType, StructField, StringType,
>>> IntegerType, LongType, DoubleType, FloatType}
>>>
>>> scala> val nulltestSchema = StructType(Seq(StructField("name",
>>> StringType, false),StructField("age", DoubleType, true)))
>>> nulltestSchema: org.apache.spark.sql.types.StructType =
>>> StructType(StructField(name,StringType,false),
>>> StructField(age,DoubleType,true))
>>>
>>> scala> val dfnulltest =
>>> sqlContext.read.format("com.databricks.spark.csv").option("header",
>>> "true").schema(nulltestSchema).load("hdfs://
>>> 172.31.29.201:8020/TestDivya/Spark/nulltest.csv")
>>> dfnulltest: org.apache.spark.sql.DataFrame = [name: string, age: double]
>>>
>>> scala> val dfchangenull =
>>> dfnulltest.na.fill(0,Seq("age")).select("name","age")
>>> dfchangenull: org.apache.spark.sql.DataFrame = [name: string, age:
>>> double]
>>>
>>> scala> dfchangenull.show
>>> 16/02/25 23:15:59 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID
>>> 2, ip-172-31-22-135.ap-southeast-1.compute.internal):
>>> java.text.ParseException: Unparseable number: "null"
>>>         at java.text.NumberFormat.parse(NumberFormat.java:350)
>>>
>>>
>>
>>
>

Re: [Help]: DataframeNAfunction fill method throwing exception

Posted by Divya Gehlot <di...@gmail.com>.

Hi Jan ,
Thanks for help.
Alas..
you suggestion also didnt work

scala> import org.apache.spark.sql.types.{StructType, StructField,
> StringType,IntegerType,LongType,DoubleType, FloatType};
> import org.apache.spark.sql.types.{StructType, StructField, StringType,
> IntegerType, LongType, DoubleType, FloatType}
> scala> val nulltestSchema = StructType(Seq(StructField("name", StringType,
> false),StructField("age", DoubleType, true)))
> nulltestSchema: org.apache.spark.sql.types.StructType =
> StructType(StructField(name,StringType,false),
> StructField(age,DoubleType,true))
> scala> val dfnulltest =
> sqlContext.read.format("com.databricks.spark.csv").option("header",
> "true").schema(nulltestSchema).load("hdfs://xx.xx.xx.xxx:8020/TestDivya/Spark/nulltest.csv")
> dfnulltest: org.apache.spark.sql.DataFrame = [name: string, age: double]
> scala> dfnulltest.selectExpr("name", "coalesce(age, 0) as age")
> res0: org.apache.spark.sql.DataFrame = [name: string, age: double]
> scala> val dfresult = dfnulltest.selectExpr("name", "coalesce(age, 0) as
> age")
> dfresult: org.apache.spark.sql.DataFrame = [name: string, age: double]
> scala> dfresult.show


 java.text.ParseException: Unparseable number: "null"
        at java.text.NumberFormat.parse(NumberFormat.java:350)


On 26 February 2016 at 15:15, Jan Štěrba <in...@jansterba.com> wrote:

> just use coalesce function
>
> df.selectExpr("name", "coalesce(age, 0) as age")
>
> --
> Jan Sterba
> https://twitter.com/honzasterba | http://flickr.com/honzasterba |
> http://500px.com/honzasterba
>
> On Fri, Feb 26, 2016 at 5:27 AM, Divya Gehlot <di...@gmail.com>
> wrote:
>
>> Hi,
>> I have dataset which looks like below
>> name age
>> alice 35
>> bob null
>> peter 24
>> I need to replace null values of columns with 0
>> so  I referred Spark API DataframeNAfunctions.scala
>> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala>
>>
>>  I tried the below code its throwing exception
>> scala> import org.apache.spark.sql.types.{StructType, StructField,
>> StringType,IntegerType,LongType,DoubleType, FloatType};
>> import org.apache.spark.sql.types.{StructType, StructField, StringType,
>> IntegerType, LongType, DoubleType, FloatType}
>>
>> scala> val nulltestSchema = StructType(Seq(StructField("name",
>> StringType, false),StructField("age", DoubleType, true)))
>> nulltestSchema: org.apache.spark.sql.types.StructType =
>> StructType(StructField(name,StringType,false),
>> StructField(age,DoubleType,true))
>>
>> scala> val dfnulltest =
>> sqlContext.read.format("com.databricks.spark.csv").option("header",
>> "true").schema(nulltestSchema).load("hdfs://
>> 172.31.29.201:8020/TestDivya/Spark/nulltest.csv")
>> dfnulltest: org.apache.spark.sql.DataFrame = [name: string, age: double]
>>
>> scala> val dfchangenull =
>> dfnulltest.na.fill(0,Seq("age")).select("name","age")
>> dfchangenull: org.apache.spark.sql.DataFrame = [name: string, age: double]
>>
>> scala> dfchangenull.show
>> 16/02/25 23:15:59 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2,
>> ip-172-31-22-135.ap-southeast-1.compute.internal):
>> java.text.ParseException: Unparseable number: "null"
>>         at java.text.NumberFormat.parse(NumberFormat.java:350)
>>
>>
>
>

Re: [Help]: DataframeNAfunction fill method throwing exception

Posted by Jan Štěrba <in...@jansterba.com>.

just use coalesce function

df.selectExpr("name", "coalesce(age, 0) as age")

--
Jan Sterba
https://twitter.com/honzasterba | http://flickr.com/honzasterba |
http://500px.com/honzasterba

On Fri, Feb 26, 2016 at 5:27 AM, Divya Gehlot <di...@gmail.com>
wrote:

> Hi,
> I have dataset which looks like below
> name age
> alice 35
> bob null
> peter 24
> I need to replace null values of columns with 0
> so  I referred Spark API DataframeNAfunctions.scala
> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala>
>
>  I tried the below code its throwing exception
> scala> import org.apache.spark.sql.types.{StructType, StructField,
> StringType,IntegerType,LongType,DoubleType, FloatType};
> import org.apache.spark.sql.types.{StructType, StructField, StringType,
> IntegerType, LongType, DoubleType, FloatType}
>
> scala> val nulltestSchema = StructType(Seq(StructField("name", StringType,
> false),StructField("age", DoubleType, true)))
> nulltestSchema: org.apache.spark.sql.types.StructType =
> StructType(StructField(name,StringType,false),
> StructField(age,DoubleType,true))
>
> scala> val dfnulltest =
> sqlContext.read.format("com.databricks.spark.csv").option("header",
> "true").schema(nulltestSchema).load("hdfs://
> 172.31.29.201:8020/TestDivya/Spark/nulltest.csv")
> dfnulltest: org.apache.spark.sql.DataFrame = [name: string, age: double]
>
> scala> val dfchangenull =
> dfnulltest.na.fill(0,Seq("age")).select("name","age")
> dfchangenull: org.apache.spark.sql.DataFrame = [name: string, age: double]
>
> scala> dfchangenull.show
> 16/02/25 23:15:59 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2,
> ip-172-31-22-135.ap-southeast-1.compute.internal):
> java.text.ParseException: Unparseable number: "null"
>         at java.text.NumberFormat.parse(NumberFormat.java:350)
>
>