You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Dirceu Semighini Filho <di...@gmail.com> on 2015/09/14 22:42:07 UTC

Null Value in DecimalType column of DataFrame

Hi all,
I'm moving from spark 1.4 to 1.5, and one of my tests is failing.
It seems that there was some changes in org.apache.spark.sql.types.
DecimalType

This ugly code is a little sample to reproduce the error, don't use it into
your project.

test("spark test") {
  val file = context.sparkContext().textFile(s"${defaultFilePath}Test.csv").map(f
=> Row.fromSeq({
    val values = f.split(",")
    Seq(values.head.toString.toInt,values.tail.head.toString.toInt,BigDecimal(values.tail.tail.head),
    values.tail.tail.tail.head)}))

  val structType = StructType(Seq(StructField("id", IntegerType, false),
    StructField("int2", IntegerType, false), StructField("double",

 DecimalType(10,10), false),


    StructField("str2", StringType, false)))

  val df = context.sqlContext.createDataFrame(file,structType)
  df.first
}

The content of the file is:

1,5,10.5,va
2,1,0.1,vb
3,8,10.0,vc

The problem resides in DecimalType, before 1.5 the scala wasn't required.
Now when using  DecimalType(12,10) it works fine, but using
DecimalType(10,10) the Decimal values
10.5 became null, and the 0.1 works.

Is there anybody working with DecimalType for 1.5.1?

Regards,
Dirceu

Re: Null Value in DecimalType column of DataFrame

Posted by Reynold Xin <rx...@databricks.com>.

+dev list

Hi Dirceu,

The answer to whether throwing an exception is better or null is better
depends on your use case. If you are debugging and want to find bugs with
your program, you might prefer throwing an exception. However, if you are
running on a large real-world dataset (i.e. data is dirty) and your query
can take a while (e.g. 30 mins), you then might prefer the system to just
assign null values to the dirty data that could lead to runtime exceptions,
because otherwise you could be spending days just to clean your data.

Postgres throws exceptions here, but I think that's mainly because it is
used for OLTP, and in those cases queries are short-running. Most other
analytic databases I believe just return null. The best we can do is to
provide a config option to indicate behavior for exception handling.


On Fri, Sep 18, 2015 at 8:15 AM, Dirceu Semighini Filho <
dirceu.semighini@gmail.com> wrote:

> Hi Yin, I got that part.
> I just think that instead of returning null, throwing an exception would
> be better. In the exception message we can explain that the DecimalType
> used can't fit the number that is been converted due to the precision and
> scale values used to create it.
> It would be easier for the user to find the reason why that error is
> happening, instead of receiving an NullPointerException in another part of
> his code. We can also make a better documentation of DecimalType classes to
> explain this behavior, what do you think?
>
>
>
>
> 2015-09-17 18:52 GMT-03:00 Yin Huai <yh...@databricks.com>:
>
>> As I mentioned before, the range of values of DecimalType(10, 10) is [0,
>> 1). If you have a value 10.5 and you want to cast it to DecimalType(10,
>> 10), I do not think there is any better returned value except of null.
>> Looks like DecimalType(10, 10) is not the right type for your use case. You
>> need a decimal type that has precision - scale >= 2.
>>
>> On Tue, Sep 15, 2015 at 6:39 AM, Dirceu Semighini Filho <
>> dirceu.semighini@gmail.com> wrote:
>>
>>>
>>> Hi Yin, posted here because I think it's a bug.
>>> So, it will return null and I can get a nullpointerexception, as I was
>>> getting. Is this really the expected behavior? Never seen something
>>> returning null in other Scala tools that I used.
>>>
>>> Regards,
>>>
>>>
>>> 2015-09-14 18:54 GMT-03:00 Yin Huai <yh...@databricks.com>:
>>>
>>>> btw, move it to user list.
>>>>
>>>> On Mon, Sep 14, 2015 at 2:54 PM, Yin Huai <yh...@databricks.com> wrote:
>>>>
>>>>> A scale of 10 means that there are 10 digits at the right of the
>>>>> decimal point. If you also have precision 10, the range of your data will
>>>>> be [0, 1) and casting "10.5" to DecimalType(10, 10) will return null, which
>>>>> is expected.
>>>>>
>>>>> On Mon, Sep 14, 2015 at 1:42 PM, Dirceu Semighini Filho <
>>>>> dirceu.semighini@gmail.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>> I'm moving from spark 1.4 to 1.5, and one of my tests is failing.
>>>>>> It seems that there was some changes in org.apache.spark.sql.types.
>>>>>> DecimalType
>>>>>>
>>>>>> This ugly code is a little sample to reproduce the error, don't use
>>>>>> it into your project.
>>>>>>
>>>>>> test("spark test") {
>>>>>>   val file = context.sparkContext().textFile(s"${defaultFilePath}Test.csv").map(f => Row.fromSeq({
>>>>>>     val values = f.split(",")
>>>>>>     Seq(values.head.toString.toInt,values.tail.head.toString.toInt,BigDecimal(values.tail.tail.head),
>>>>>>     values.tail.tail.tail.head)}))
>>>>>>
>>>>>>   val structType = StructType(Seq(StructField("id", IntegerType, false),
>>>>>>     StructField("int2", IntegerType, false), StructField("double",
>>>>>>
>>>>>>  DecimalType(10,10), false),
>>>>>>
>>>>>>
>>>>>>     StructField("str2", StringType, false)))
>>>>>>
>>>>>>   val df = context.sqlContext.createDataFrame(file,structType)
>>>>>>   df.first
>>>>>> }
>>>>>>
>>>>>> The content of the file is:
>>>>>>
>>>>>> 1,5,10.5,va
>>>>>> 2,1,0.1,vb
>>>>>> 3,8,10.0,vc
>>>>>>
>>>>>> The problem resides in DecimalType, before 1.5 the scala wasn't
>>>>>> required. Now when using  DecimalType(12,10) it works fine, but
>>>>>> using DecimalType(10,10) the Decimal values
>>>>>> 10.5 became null, and the 0.1 works.
>>>>>>
>>>>>> Is there anybody working with DecimalType for 1.5.1?
>>>>>>
>>>>>> Regards,
>>>>>> Dirceu
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>

Re: Null Value in DecimalType column of DataFrame

Posted by Reynold Xin <rx...@databricks.com>.

+dev list

Hi Dirceu,

The answer to whether throwing an exception is better or null is better
depends on your use case. If you are debugging and want to find bugs with
your program, you might prefer throwing an exception. However, if you are
running on a large real-world dataset (i.e. data is dirty) and your query
can take a while (e.g. 30 mins), you then might prefer the system to just
assign null values to the dirty data that could lead to runtime exceptions,
because otherwise you could be spending days just to clean your data.

Postgres throws exceptions here, but I think that's mainly because it is
used for OLTP, and in those cases queries are short-running. Most other
analytic databases I believe just return null. The best we can do is to
provide a config option to indicate behavior for exception handling.


On Fri, Sep 18, 2015 at 8:15 AM, Dirceu Semighini Filho <
dirceu.semighini@gmail.com> wrote:

> Hi Yin, I got that part.
> I just think that instead of returning null, throwing an exception would
> be better. In the exception message we can explain that the DecimalType
> used can't fit the number that is been converted due to the precision and
> scale values used to create it.
> It would be easier for the user to find the reason why that error is
> happening, instead of receiving an NullPointerException in another part of
> his code. We can also make a better documentation of DecimalType classes to
> explain this behavior, what do you think?
>
>
>
>
> 2015-09-17 18:52 GMT-03:00 Yin Huai <yh...@databricks.com>:
>
>> As I mentioned before, the range of values of DecimalType(10, 10) is [0,
>> 1). If you have a value 10.5 and you want to cast it to DecimalType(10,
>> 10), I do not think there is any better returned value except of null.
>> Looks like DecimalType(10, 10) is not the right type for your use case. You
>> need a decimal type that has precision - scale >= 2.
>>
>> On Tue, Sep 15, 2015 at 6:39 AM, Dirceu Semighini Filho <
>> dirceu.semighini@gmail.com> wrote:
>>
>>>
>>> Hi Yin, posted here because I think it's a bug.
>>> So, it will return null and I can get a nullpointerexception, as I was
>>> getting. Is this really the expected behavior? Never seen something
>>> returning null in other Scala tools that I used.
>>>
>>> Regards,
>>>
>>>
>>> 2015-09-14 18:54 GMT-03:00 Yin Huai <yh...@databricks.com>:
>>>
>>>> btw, move it to user list.
>>>>
>>>> On Mon, Sep 14, 2015 at 2:54 PM, Yin Huai <yh...@databricks.com> wrote:
>>>>
>>>>> A scale of 10 means that there are 10 digits at the right of the
>>>>> decimal point. If you also have precision 10, the range of your data will
>>>>> be [0, 1) and casting "10.5" to DecimalType(10, 10) will return null, which
>>>>> is expected.
>>>>>
>>>>> On Mon, Sep 14, 2015 at 1:42 PM, Dirceu Semighini Filho <
>>>>> dirceu.semighini@gmail.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>> I'm moving from spark 1.4 to 1.5, and one of my tests is failing.
>>>>>> It seems that there was some changes in org.apache.spark.sql.types.
>>>>>> DecimalType
>>>>>>
>>>>>> This ugly code is a little sample to reproduce the error, don't use
>>>>>> it into your project.
>>>>>>
>>>>>> test("spark test") {
>>>>>>   val file = context.sparkContext().textFile(s"${defaultFilePath}Test.csv").map(f => Row.fromSeq({
>>>>>>     val values = f.split(",")
>>>>>>     Seq(values.head.toString.toInt,values.tail.head.toString.toInt,BigDecimal(values.tail.tail.head),
>>>>>>     values.tail.tail.tail.head)}))
>>>>>>
>>>>>>   val structType = StructType(Seq(StructField("id", IntegerType, false),
>>>>>>     StructField("int2", IntegerType, false), StructField("double",
>>>>>>
>>>>>>  DecimalType(10,10), false),
>>>>>>
>>>>>>
>>>>>>     StructField("str2", StringType, false)))
>>>>>>
>>>>>>   val df = context.sqlContext.createDataFrame(file,structType)
>>>>>>   df.first
>>>>>> }
>>>>>>
>>>>>> The content of the file is:
>>>>>>
>>>>>> 1,5,10.5,va
>>>>>> 2,1,0.1,vb
>>>>>> 3,8,10.0,vc
>>>>>>
>>>>>> The problem resides in DecimalType, before 1.5 the scala wasn't
>>>>>> required. Now when using  DecimalType(12,10) it works fine, but
>>>>>> using DecimalType(10,10) the Decimal values
>>>>>> 10.5 became null, and the 0.1 works.
>>>>>>
>>>>>> Is there anybody working with DecimalType for 1.5.1?
>>>>>>
>>>>>> Regards,
>>>>>> Dirceu
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>

Re: Null Value in DecimalType column of DataFrame

Posted by Dirceu Semighini Filho <di...@gmail.com>.

Hi Yin, I got that part.
I just think that instead of returning null, throwing an exception would be
better. In the exception message we can explain that the DecimalType used
can't fit the number that is been converted due to the precision and scale
values used to create it.
It would be easier for the user to find the reason why that error is
happening, instead of receiving an NullPointerException in another part of
his code. We can also make a better documentation of DecimalType classes to
explain this behavior, what do you think?




2015-09-17 18:52 GMT-03:00 Yin Huai <yh...@databricks.com>:

> As I mentioned before, the range of values of DecimalType(10, 10) is [0,
> 1). If you have a value 10.5 and you want to cast it to DecimalType(10,
> 10), I do not think there is any better returned value except of null.
> Looks like DecimalType(10, 10) is not the right type for your use case. You
> need a decimal type that has precision - scale >= 2.
>
> On Tue, Sep 15, 2015 at 6:39 AM, Dirceu Semighini Filho <
> dirceu.semighini@gmail.com> wrote:
>
>>
>> Hi Yin, posted here because I think it's a bug.
>> So, it will return null and I can get a nullpointerexception, as I was
>> getting. Is this really the expected behavior? Never seen something
>> returning null in other Scala tools that I used.
>>
>> Regards,
>>
>>
>> 2015-09-14 18:54 GMT-03:00 Yin Huai <yh...@databricks.com>:
>>
>>> btw, move it to user list.
>>>
>>> On Mon, Sep 14, 2015 at 2:54 PM, Yin Huai <yh...@databricks.com> wrote:
>>>
>>>> A scale of 10 means that there are 10 digits at the right of the
>>>> decimal point. If you also have precision 10, the range of your data will
>>>> be [0, 1) and casting "10.5" to DecimalType(10, 10) will return null, which
>>>> is expected.
>>>>
>>>> On Mon, Sep 14, 2015 at 1:42 PM, Dirceu Semighini Filho <
>>>> dirceu.semighini@gmail.com> wrote:
>>>>
>>>>> Hi all,
>>>>> I'm moving from spark 1.4 to 1.5, and one of my tests is failing.
>>>>> It seems that there was some changes in org.apache.spark.sql.types.
>>>>> DecimalType
>>>>>
>>>>> This ugly code is a little sample to reproduce the error, don't use it
>>>>> into your project.
>>>>>
>>>>> test("spark test") {
>>>>>   val file = context.sparkContext().textFile(s"${defaultFilePath}Test.csv").map(f => Row.fromSeq({
>>>>>     val values = f.split(",")
>>>>>     Seq(values.head.toString.toInt,values.tail.head.toString.toInt,BigDecimal(values.tail.tail.head),
>>>>>     values.tail.tail.tail.head)}))
>>>>>
>>>>>   val structType = StructType(Seq(StructField("id", IntegerType, false),
>>>>>     StructField("int2", IntegerType, false), StructField("double",
>>>>>
>>>>>  DecimalType(10,10), false),
>>>>>
>>>>>
>>>>>     StructField("str2", StringType, false)))
>>>>>
>>>>>   val df = context.sqlContext.createDataFrame(file,structType)
>>>>>   df.first
>>>>> }
>>>>>
>>>>> The content of the file is:
>>>>>
>>>>> 1,5,10.5,va
>>>>> 2,1,0.1,vb
>>>>> 3,8,10.0,vc
>>>>>
>>>>> The problem resides in DecimalType, before 1.5 the scala wasn't
>>>>> required. Now when using  DecimalType(12,10) it works fine, but using
>>>>> DecimalType(10,10) the Decimal values
>>>>> 10.5 became null, and the 0.1 works.
>>>>>
>>>>> Is there anybody working with DecimalType for 1.5.1?
>>>>>
>>>>> Regards,
>>>>> Dirceu
>>>>>
>>>>>
>>>>
>>>
>>
>>
>

Re: Null Value in DecimalType column of DataFrame

Posted by Yin Huai <yh...@databricks.com>.

As I mentioned before, the range of values of DecimalType(10, 10) is [0,
1). If you have a value 10.5 and you want to cast it to DecimalType(10,
10), I do not think there is any better returned value except of null.
Looks like DecimalType(10, 10) is not the right type for your use case. You
need a decimal type that has precision - scale >= 2.

On Tue, Sep 15, 2015 at 6:39 AM, Dirceu Semighini Filho <
dirceu.semighini@gmail.com> wrote:

>
> Hi Yin, posted here because I think it's a bug.
> So, it will return null and I can get a nullpointerexception, as I was
> getting. Is this really the expected behavior? Never seen something
> returning null in other Scala tools that I used.
>
> Regards,
>
>
> 2015-09-14 18:54 GMT-03:00 Yin Huai <yh...@databricks.com>:
>
>> btw, move it to user list.
>>
>> On Mon, Sep 14, 2015 at 2:54 PM, Yin Huai <yh...@databricks.com> wrote:
>>
>>> A scale of 10 means that there are 10 digits at the right of the decimal
>>> point. If you also have precision 10, the range of your data will be [0, 1)
>>> and casting "10.5" to DecimalType(10, 10) will return null, which is
>>> expected.
>>>
>>> On Mon, Sep 14, 2015 at 1:42 PM, Dirceu Semighini Filho <
>>> dirceu.semighini@gmail.com> wrote:
>>>
>>>> Hi all,
>>>> I'm moving from spark 1.4 to 1.5, and one of my tests is failing.
>>>> It seems that there was some changes in org.apache.spark.sql.types.
>>>> DecimalType
>>>>
>>>> This ugly code is a little sample to reproduce the error, don't use it
>>>> into your project.
>>>>
>>>> test("spark test") {
>>>>   val file = context.sparkContext().textFile(s"${defaultFilePath}Test.csv").map(f => Row.fromSeq({
>>>>     val values = f.split(",")
>>>>     Seq(values.head.toString.toInt,values.tail.head.toString.toInt,BigDecimal(values.tail.tail.head),
>>>>     values.tail.tail.tail.head)}))
>>>>
>>>>   val structType = StructType(Seq(StructField("id", IntegerType, false),
>>>>     StructField("int2", IntegerType, false), StructField("double",
>>>>
>>>>  DecimalType(10,10), false),
>>>>
>>>>
>>>>     StructField("str2", StringType, false)))
>>>>
>>>>   val df = context.sqlContext.createDataFrame(file,structType)
>>>>   df.first
>>>> }
>>>>
>>>> The content of the file is:
>>>>
>>>> 1,5,10.5,va
>>>> 2,1,0.1,vb
>>>> 3,8,10.0,vc
>>>>
>>>> The problem resides in DecimalType, before 1.5 the scala wasn't
>>>> required. Now when using  DecimalType(12,10) it works fine, but using
>>>> DecimalType(10,10) the Decimal values
>>>> 10.5 became null, and the 0.1 works.
>>>>
>>>> Is there anybody working with DecimalType for 1.5.1?
>>>>
>>>> Regards,
>>>> Dirceu
>>>>
>>>>
>>>
>>
>
>

Fwd: Null Value in DecimalType column of DataFrame

Posted by Dirceu Semighini Filho <di...@gmail.com>.

Hi Yin, posted here because I think it's a bug.
So, it will return null and I can get a nullpointerexception, as I was
getting. Is this really the expected behavior? Never seen something
returning null in other Scala tools that I used.

Regards,


2015-09-14 18:54 GMT-03:00 Yin Huai <yh...@databricks.com>:

> btw, move it to user list.
>
> On Mon, Sep 14, 2015 at 2:54 PM, Yin Huai <yh...@databricks.com> wrote:
>
>> A scale of 10 means that there are 10 digits at the right of the decimal
>> point. If you also have precision 10, the range of your data will be [0, 1)
>> and casting "10.5" to DecimalType(10, 10) will return null, which is
>> expected.
>>
>> On Mon, Sep 14, 2015 at 1:42 PM, Dirceu Semighini Filho <
>> dirceu.semighini@gmail.com> wrote:
>>
>>> Hi all,
>>> I'm moving from spark 1.4 to 1.5, and one of my tests is failing.
>>> It seems that there was some changes in org.apache.spark.sql.types.
>>> DecimalType
>>>
>>> This ugly code is a little sample to reproduce the error, don't use it
>>> into your project.
>>>
>>> test("spark test") {
>>>   val file = context.sparkContext().textFile(s"${defaultFilePath}Test.csv").map(f => Row.fromSeq({
>>>     val values = f.split(",")
>>>     Seq(values.head.toString.toInt,values.tail.head.toString.toInt,BigDecimal(values.tail.tail.head),
>>>     values.tail.tail.tail.head)}))
>>>
>>>   val structType = StructType(Seq(StructField("id", IntegerType, false),
>>>     StructField("int2", IntegerType, false), StructField("double",
>>>
>>>  DecimalType(10,10), false),
>>>
>>>
>>>     StructField("str2", StringType, false)))
>>>
>>>   val df = context.sqlContext.createDataFrame(file,structType)
>>>   df.first
>>> }
>>>
>>> The content of the file is:
>>>
>>> 1,5,10.5,va
>>> 2,1,0.1,vb
>>> 3,8,10.0,vc
>>>
>>> The problem resides in DecimalType, before 1.5 the scala wasn't
>>> required. Now when using  DecimalType(12,10) it works fine, but using
>>> DecimalType(10,10) the Decimal values
>>> 10.5 became null, and the 0.1 works.
>>>
>>> Is there anybody working with DecimalType for 1.5.1?
>>>
>>> Regards,
>>> Dirceu
>>>
>>>
>>
>

Re: Null Value in DecimalType column of DataFrame

Posted by Yin Huai <yh...@databricks.com>.

btw, move it to user list.

On Mon, Sep 14, 2015 at 2:54 PM, Yin Huai <yh...@databricks.com> wrote:

> A scale of 10 means that there are 10 digits at the right of the decimal
> point. If you also have precision 10, the range of your data will be [0, 1)
> and casting "10.5" to DecimalType(10, 10) will return null, which is
> expected.
>
> On Mon, Sep 14, 2015 at 1:42 PM, Dirceu Semighini Filho <
> dirceu.semighini@gmail.com> wrote:
>
>> Hi all,
>> I'm moving from spark 1.4 to 1.5, and one of my tests is failing.
>> It seems that there was some changes in org.apache.spark.sql.types.
>> DecimalType
>>
>> This ugly code is a little sample to reproduce the error, don't use it
>> into your project.
>>
>> test("spark test") {
>>   val file = context.sparkContext().textFile(s"${defaultFilePath}Test.csv").map(f => Row.fromSeq({
>>     val values = f.split(",")
>>     Seq(values.head.toString.toInt,values.tail.head.toString.toInt,BigDecimal(values.tail.tail.head),
>>     values.tail.tail.tail.head)}))
>>
>>   val structType = StructType(Seq(StructField("id", IntegerType, false),
>>     StructField("int2", IntegerType, false), StructField("double",
>>
>>  DecimalType(10,10), false),
>>
>>
>>     StructField("str2", StringType, false)))
>>
>>   val df = context.sqlContext.createDataFrame(file,structType)
>>   df.first
>> }
>>
>> The content of the file is:
>>
>> 1,5,10.5,va
>> 2,1,0.1,vb
>> 3,8,10.0,vc
>>
>> The problem resides in DecimalType, before 1.5 the scala wasn't required.
>> Now when using  DecimalType(12,10) it works fine, but using
>> DecimalType(10,10) the Decimal values
>> 10.5 became null, and the 0.1 works.
>>
>> Is there anybody working with DecimalType for 1.5.1?
>>
>> Regards,
>> Dirceu
>>
>>
>

Re: Null Value in DecimalType column of DataFrame

Posted by Yin Huai <yh...@databricks.com>.

A scale of 10 means that there are 10 digits at the right of the decimal
point. If you also have precision 10, the range of your data will be [0, 1)
and casting "10.5" to DecimalType(10, 10) will return null, which is
expected.

On Mon, Sep 14, 2015 at 1:42 PM, Dirceu Semighini Filho <
dirceu.semighini@gmail.com> wrote:

> Hi all,
> I'm moving from spark 1.4 to 1.5, and one of my tests is failing.
> It seems that there was some changes in org.apache.spark.sql.types.
> DecimalType
>
> This ugly code is a little sample to reproduce the error, don't use it
> into your project.
>
> test("spark test") {
>   val file = context.sparkContext().textFile(s"${defaultFilePath}Test.csv").map(f => Row.fromSeq({
>     val values = f.split(",")
>     Seq(values.head.toString.toInt,values.tail.head.toString.toInt,BigDecimal(values.tail.tail.head),
>     values.tail.tail.tail.head)}))
>
>   val structType = StructType(Seq(StructField("id", IntegerType, false),
>     StructField("int2", IntegerType, false), StructField("double",
>
>  DecimalType(10,10), false),
>
>
>     StructField("str2", StringType, false)))
>
>   val df = context.sqlContext.createDataFrame(file,structType)
>   df.first
> }
>
> The content of the file is:
>
> 1,5,10.5,va
> 2,1,0.1,vb
> 3,8,10.0,vc
>
> The problem resides in DecimalType, before 1.5 the scala wasn't required.
> Now when using  DecimalType(12,10) it works fine, but using
> DecimalType(10,10) the Decimal values
> 10.5 became null, and the 0.1 works.
>
> Is there anybody working with DecimalType for 1.5.1?
>
> Regards,
> Dirceu
>
>