You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Mich Talebzadeh <mi...@peridale.co.uk> on 2016/02/20 09:24:01 UTC

Checking for null values when mapping

 

I have a DF like below reading a csv file

 

 

val df =
HiveContext.read.format("com.databricks.spark.csv").option("inferSchema",
"true").option("header", "true").load("/data/stg/table2")

 

val a = df.map(x => (x.getString(0), x.getString(1),
x.getString(2).substring(1).replace(",",
"").toDouble,x.getString(3).substring(1).replace(",", "").toDouble,
x.getString(4).substring(1).replace(",", "").toDouble))

 

 

For most rows I am reading from csv file the above mapping works fine.
However, at the bottom of csv there are couple of empty columns as below

 

[421,02/10/2015,?1,187.50,?237.50,?1,425.00]

[,,,,]

[Net income,,?182,531.25,?14,606.25,?197,137.50]

[,,,,]

[year 2014,,?113,500.00,?0.00,?113,500.00]

[Year 2015,,?69,031.25,?14,606.25,?83,637.50]

 

However, I get 

 

a.collect.foreach(println)

16/02/20 08:31:53 ERROR Executor: Exception in task 0.0 in stage 123.0 (TID
161)

java.lang.StringIndexOutOfBoundsException: String index out of range: -1

 

I suspect the cause is substring operation  say x.getString(2).substring(1)
on empty values that according to web will throw this type of error

 

 

The easiest solution seems to be to check whether x above is not null and do
the substring operation. Can this be done without using a UDF?

 

Thanks

 

Dr Mich Talebzadeh

 

LinkedIn
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABU
rV8Pw>
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Technology Ltd, its subsidiaries nor their
employees accept any responsibility.

 

 


Re: Checking for null values when mapping

Posted by Chandeep Singh <cs...@chandeep.com>.
Ah. Ok.

> On Feb 20, 2016, at 2:31 PM, Mich Talebzadeh <mi...@peridale.co.uk> wrote:
> 
> Yes I did that as well but no joy. My shell does it for windows files automatically
>  
> Thanks, 
>  
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.
>  
>  
> From: Chandeep Singh [mailto:cs@chandeep.com] 
> Sent: 20 February 2016 14:27
> To: Mich Talebzadeh <mi...@peridale.co.uk>
> Cc: user @spark <us...@spark.apache.org>
> Subject: Re: Checking for null values when mapping
>  
> Also, have you looked into Dos2Unix (http://dos2unix.sourceforge.net/ <http://dos2unix.sourceforge.net/>)
>  
> Has helped me in the past to deal with special characters while using windows based CSV’s in Linux. (Might not be the solution here.. Just an FYI :))
>  
>> On Feb 20, 2016, at 2:17 PM, Chandeep Singh <cs@chandeep.com <ma...@chandeep.com>> wrote:
>>  
>> Understood. In that case Ted’s suggestion to check the length should solve the problem.
>>  
>>> On Feb 20, 2016, at 2:09 PM, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk>> wrote:
>>>  
>>> Hi,
>>>  
>>> That is a good question.
>>>  
>>> When data is exported from CSV to Linux, any character that cannot be transformed is replaced by ?. That question mark is not actually the expected “?” J
>>>  
>>> So the only way I can get rid of it is by drooping the first character using substring(1). I checked I did the same in Hive sql
>>>  
>>> The actual field in CSV is “£2,500.oo” that translates into “?2,500.00”
>>>  
>>> HTH
>>>  
>>>  
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>  
>>> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.
>>>  
>>>  
>>> From: Chandeep Singh [mailto:cs@chandeep.com <ma...@chandeep.com>] 
>>> Sent: 20 February 2016 13:47
>>> To: Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk>>
>>> Cc: user @spark <user@spark.apache.org <ma...@spark.apache.org>>
>>> Subject: Re: Checking for null values when mapping
>>>  
>>> Looks like you’re using substring just to get rid of the ‘?’. Why not use replace for that as well? And then you wouldn’t run into issues with index out of bound.
>>>  
>>> val a = "?1,187.50"  
>>> val b = ""
>>>  
>>> println(a.substring(1).replace(",", "”))
>>> —> 1187.50
>>>  
>>> println(a.replace("?", "").replace(",", "”))
>>> —> 1187.50
>>>  
>>> println(b.replace("?", "").replace(",", "”))
>>> —> No error / output since both ‘?' and ‘,' don’t exist.
>>>  
>>>  
>>>> On Feb 20, 2016, at 8:24 AM, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk>> wrote:
>>>>  
>>>>  
>>>> I have a DF like below reading a csv file
>>>>  
>>>>  
>>>> val df = HiveContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("/data/stg/table2")
>>>>  
>>>> val a = df.map(x => (x.getString(0), x.getString(1), x.getString(2).substring(1).replace(",", "").toDouble,x.getString(3).substring(1).replace(",", "").toDouble, x.getString(4).substring(1).replace(",", "").toDouble))
>>>>  
>>>>  
>>>> For most rows I am reading from csv file the above mapping works fine. However, at the bottom of csv there are couple of empty columns as below
>>>>  
>>>> [421,02/10/2015,?1,187.50,?237.50,?1,425.00]
>>>> [,,,,]
>>>> [Net income,,?182,531.25,?14,606.25,?197,137.50]
>>>> [,,,,]
>>>> [year 2014,,?113,500.00,?0.00,?113,500.00]
>>>> [Year 2015,,?69,031.25,?14,606.25,?83,637.50]
>>>>  
>>>> However, I get 
>>>>  
>>>> a.collect.foreach(println)
>>>> 16/02/20 08:31:53 ERROR Executor: Exception in task 0.0 in stage 123.0 (TID 161)
>>>> java.lang.StringIndexOutOfBoundsException: String index out of range: -1
>>>>  
>>>> I suspect the cause is substring operation  say x.getString(2).substring(1) on empty values that according to web will throw this type of error
>>>>  
>>>>  
>>>> The easiest solution seems to be to check whether x above is not null and do the substring operation. Can this be done without using a UDF?
>>>>  
>>>> Thanks
>>>>  
>>>> Dr Mich Talebzadeh
>>>>  
>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>>  
>>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>>  
>>>> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.


RE: Checking for null values when mapping

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Yes I did that as well but no joy. My shell does it for windows files automatically

 

Thanks, 

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.

 

 

From: Chandeep Singh [mailto:cs@chandeep.com] 
Sent: 20 February 2016 14:27
To: Mich Talebzadeh <mi...@peridale.co.uk>
Cc: user @spark <us...@spark.apache.org>
Subject: Re: Checking for null values when mapping

 

Also, have you looked into Dos2Unix (http://dos2unix.sourceforge.net/)

 

Has helped me in the past to deal with special characters while using windows based CSV’s in Linux. (Might not be the solution here.. Just an FYI :))

 

On Feb 20, 2016, at 2:17 PM, Chandeep Singh <cs@chandeep.com <ma...@chandeep.com> > wrote:

 

Understood. In that case Ted’s suggestion to check the length should solve the problem.

 

On Feb 20, 2016, at 2:09 PM, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk> > wrote:

 

Hi,

 

That is a good question.

 

When data is exported from CSV to Linux, any character that cannot be transformed is replaced by ?. That question mark is not actually the expected “?” :)

 

So the only way I can get rid of it is by drooping the first character using substring(1). I checked I did the same in Hive sql

 

The actual field in CSV is “£2,500.oo” that translates into “?2,500.00”

 

HTH

 

 

Dr Mich Talebzadeh

 

LinkedIn   <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.

 

 

From: Chandeep Singh [mailto:cs@chandeep.com] 
Sent: 20 February 2016 13:47
To: Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk> >
Cc: user @spark <user@spark.apache.org <ma...@spark.apache.org> >
Subject: Re: Checking for null values when mapping

 

Looks like you’re using substring just to get rid of the ‘?’. Why not use replace for that as well? And then you wouldn’t run into issues with index out of bound.

 

val a = "?1,187.50"  

val b = ""

 

println(a.substring(1).replace(",", "”))

—> 1187.50

 

println(a.replace("?", "").replace(",", "”))

—> 1187.50

 

println(b.replace("?", "").replace(",", "”))

—> No error / output since both ‘?' and ‘,' don’t exist.

 

 

On Feb 20, 2016, at 8:24 AM, Mich Talebzadeh < <ma...@peridale.co.uk> mich@peridale.co.uk> wrote:

 

 

I have a DF like below reading a csv file

 

 

val df = HiveContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("/data/stg/table2")

 

val a = df.map(x => (x.getString(0), x.getString(1), x.getString(2).substring(1).replace(",", "").toDouble,x.getString(3).substring(1).replace(",", "").toDouble, x.getString(4).substring(1).replace(",", "").toDouble))

 

 

For most rows I am reading from csv file the above mapping works fine. However, at the bottom of csv there are couple of empty columns as below

 

[421,02/10/2015,?1,187.50,?237.50,?1,425.00]

[,,,,]

[Net income,,?182,531.25,?14,606.25,?197,137.50]

[,,,,]

[year 2014,,?113,500.00,?0.00,?113,500.00]

[Year 2015,,?69,031.25,?14,606.25,?83,637.50]

 

However, I get 

 

a.collect.foreach(println)

16/02/20 08:31:53 ERROR Executor: Exception in task 0.0 in stage 123.0 (TID 161)

java.lang.StringIndexOutOfBoundsException: String index out of range: -1

 

I suspect the cause is substring operation  say x.getString(2).substring(1) on empty values that according to web will throw this type of error

 

 

The easiest solution seems to be to check whether x above is not null and do the substring operation. Can this be done without using a UDF?

 

Thanks

 

Dr Mich Talebzadeh

 

LinkedIn   <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.

 

 


Re: Checking for null values when mapping

Posted by Chandeep Singh <cs...@chandeep.com>.
Also, have you looked into Dos2Unix (http://dos2unix.sourceforge.net/ <http://dos2unix.sourceforge.net/>)

Has helped me in the past to deal with special characters while using windows based CSV’s in Linux. (Might not be the solution here.. Just an FYI :))

> On Feb 20, 2016, at 2:17 PM, Chandeep Singh <cs...@chandeep.com> wrote:
> 
> Understood. In that case Ted’s suggestion to check the length should solve the problem.
> 
>> On Feb 20, 2016, at 2:09 PM, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk>> wrote:
>> 
>> Hi,
>>  
>> That is a good question.
>>  
>> When data is exported from CSV to Linux, any character that cannot be transformed is replaced by ?. That question mark is not actually the expected “?” J
>>  
>> So the only way I can get rid of it is by drooping the first character using substring(1). I checked I did the same in Hive sql
>>  
>> The actual field in CSV is “£2,500.oo” that translates into “?2,500.00”
>>  
>> HTH
>>  
>>  
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.
>>  
>>  
>> From: Chandeep Singh [mailto:cs@chandeep.com <ma...@chandeep.com>] 
>> Sent: 20 February 2016 13:47
>> To: Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk>>
>> Cc: user @spark <user@spark.apache.org <ma...@spark.apache.org>>
>> Subject: Re: Checking for null values when mapping
>>  
>> Looks like you’re using substring just to get rid of the ‘?’. Why not use replace for that as well? And then you wouldn’t run into issues with index out of bound.
>>  
>> val a = "?1,187.50"  
>> val b = ""
>>  
>> println(a.substring(1).replace(",", "”))
>> —> 1187.50
>>  
>> println(a.replace("?", "").replace(",", "”))
>> —> 1187.50
>>  
>> println(b.replace("?", "").replace(",", "”))
>> —> No error / output since both ‘?' and ‘,' don’t exist.
>>  
>>  
>>> On Feb 20, 2016, at 8:24 AM, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk>> wrote:
>>>  
>>>  
>>> I have a DF like below reading a csv file
>>>  
>>>  
>>> val df = HiveContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("/data/stg/table2")
>>>  
>>> val a = df.map(x => (x.getString(0), x.getString(1), x.getString(2).substring(1).replace(",", "").toDouble,x.getString(3).substring(1).replace(",", "").toDouble, x.getString(4).substring(1).replace(",", "").toDouble))
>>>  
>>>  
>>> For most rows I am reading from csv file the above mapping works fine. However, at the bottom of csv there are couple of empty columns as below
>>>  
>>> [421,02/10/2015,?1,187.50,?237.50,?1,425.00]
>>> [,,,,]
>>> [Net income,,?182,531.25,?14,606.25,?197,137.50]
>>> [,,,,]
>>> [year 2014,,?113,500.00,?0.00,?113,500.00]
>>> [Year 2015,,?69,031.25,?14,606.25,?83,637.50]
>>>  
>>> However, I get 
>>>  
>>> a.collect.foreach(println)
>>> 16/02/20 08:31:53 ERROR Executor: Exception in task 0.0 in stage 123.0 (TID 161)
>>> java.lang.StringIndexOutOfBoundsException: String index out of range: -1
>>>  
>>> I suspect the cause is substring operation  say x.getString(2).substring(1) on empty values that according to web will throw this type of error
>>>  
>>>  
>>> The easiest solution seems to be to check whether x above is not null and do the substring operation. Can this be done without using a UDF?
>>>  
>>> Thanks
>>>  
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>  
>>> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.
> 


Re: Checking for null values when mapping

Posted by Chandeep Singh <cs...@chandeep.com>.
Understood. In that case Ted’s suggestion to check the length should solve the problem.

> On Feb 20, 2016, at 2:09 PM, Mich Talebzadeh <mi...@peridale.co.uk> wrote:
> 
> Hi,
>  
> That is a good question.
>  
> When data is exported from CSV to Linux, any character that cannot be transformed is replaced by ?. That question mark is not actually the expected “?” J
>  
> So the only way I can get rid of it is by drooping the first character using substring(1). I checked I did the same in Hive sql
>  
> The actual field in CSV is “£2,500.oo” that translates into “?2,500.00”
>  
> HTH
>  
>  
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.
>  
>  
> From: Chandeep Singh [mailto:cs@chandeep.com] 
> Sent: 20 February 2016 13:47
> To: Mich Talebzadeh <mi...@peridale.co.uk>
> Cc: user @spark <us...@spark.apache.org>
> Subject: Re: Checking for null values when mapping
>  
> Looks like you’re using substring just to get rid of the ‘?’. Why not use replace for that as well? And then you wouldn’t run into issues with index out of bound.
>  
> val a = "?1,187.50"  
> val b = ""
>  
> println(a.substring(1).replace(",", "”))
> —> 1187.50
>  
> println(a.replace("?", "").replace(",", "”))
> —> 1187.50
>  
> println(b.replace("?", "").replace(",", "”))
> —> No error / output since both ‘?' and ‘,' don’t exist.
>  
>  
>> On Feb 20, 2016, at 8:24 AM, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk>> wrote:
>>  
>>  
>> I have a DF like below reading a csv file
>>  
>>  
>> val df = HiveContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("/data/stg/table2")
>>  
>> val a = df.map(x => (x.getString(0), x.getString(1), x.getString(2).substring(1).replace(",", "").toDouble,x.getString(3).substring(1).replace(",", "").toDouble, x.getString(4).substring(1).replace(",", "").toDouble))
>>  
>>  
>> For most rows I am reading from csv file the above mapping works fine. However, at the bottom of csv there are couple of empty columns as below
>>  
>> [421,02/10/2015,?1,187.50,?237.50,?1,425.00]
>> [,,,,]
>> [Net income,,?182,531.25,?14,606.25,?197,137.50]
>> [,,,,]
>> [year 2014,,?113,500.00,?0.00,?113,500.00]
>> [Year 2015,,?69,031.25,?14,606.25,?83,637.50]
>>  
>> However, I get 
>>  
>> a.collect.foreach(println)
>> 16/02/20 08:31:53 ERROR Executor: Exception in task 0.0 in stage 123.0 (TID 161)
>> java.lang.StringIndexOutOfBoundsException: String index out of range: -1
>>  
>> I suspect the cause is substring operation  say x.getString(2).substring(1) on empty values that according to web will throw this type of error
>>  
>>  
>> The easiest solution seems to be to check whether x above is not null and do the substring operation. Can this be done without using a UDF?
>>  
>> Thanks
>>  
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.


RE: Checking for null values when mapping

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Hi,

 

That is a good question.

 

When data is exported from CSV to Linux, any character that cannot be transformed is replaced by ?. That question mark is not actually the expected “?” :) 

 

So the only way I can get rid of it is by drooping the first character using substring(1). I checked I did the same in Hive sql

 

The actual field in CSV is “£2,500.oo” that translates into “?2,500.00”

 

HTH

 

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.

 

 

From: Chandeep Singh [mailto:cs@chandeep.com] 
Sent: 20 February 2016 13:47
To: Mich Talebzadeh <mi...@peridale.co.uk>
Cc: user @spark <us...@spark.apache.org>
Subject: Re: Checking for null values when mapping

 

Looks like you’re using substring just to get rid of the ‘?’. Why not use replace for that as well? And then you wouldn’t run into issues with index out of bound.

 

val a = "?1,187.50"  

val b = ""

 

println(a.substring(1).replace(",", "”))

—> 1187.50

 

println(a.replace("?", "").replace(",", "”))

—> 1187.50

 

println(b.replace("?", "").replace(",", "”))

—> No error / output since both ‘?' and ‘,' don’t exist.

 

 

On Feb 20, 2016, at 8:24 AM, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk> > wrote:

 

 

I have a DF like below reading a csv file

 

 

val df = HiveContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("/data/stg/table2")

 

val a = df.map(x => (x.getString(0), x.getString(1), x.getString(2).substring(1).replace(",", "").toDouble,x.getString(3).substring(1).replace(",", "").toDouble, x.getString(4).substring(1).replace(",", "").toDouble))

 

 

For most rows I am reading from csv file the above mapping works fine. However, at the bottom of csv there are couple of empty columns as below

 

[421,02/10/2015,?1,187.50,?237.50,?1,425.00]

[,,,,]

[Net income,,?182,531.25,?14,606.25,?197,137.50]

[,,,,]

[year 2014,,?113,500.00,?0.00,?113,500.00]

[Year 2015,,?69,031.25,?14,606.25,?83,637.50]

 

However, I get 

 

a.collect.foreach(println)

16/02/20 08:31:53 ERROR Executor: Exception in task 0.0 in stage 123.0 (TID 161)

java.lang.StringIndexOutOfBoundsException: String index out of range: -1

 

I suspect the cause is substring operation  say x.getString(2).substring(1) on empty values that according to web will throw this type of error

 

 

The easiest solution seems to be to check whether x above is not null and do the substring operation. Can this be done without using a UDF?

 

Thanks

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.

 


Re: Checking for null values when mapping

Posted by Chandeep Singh <cs...@chandeep.com>.
Looks like you’re using substring just to get rid of the ‘?’. Why not use replace for that as well? And then you wouldn’t run into issues with index out of bound.

val a = "?1,187.50"  
val b = ""

println(a.substring(1).replace(",", "”))
—> 1187.50

println(a.replace("?", "").replace(",", "”))
—> 1187.50

println(b.replace("?", "").replace(",", "”))
—> No error / output since both ‘?' and ‘,' don’t exist.


> On Feb 20, 2016, at 8:24 AM, Mich Talebzadeh <mi...@peridale.co.uk> wrote:
> 
>  
> I have a DF like below reading a csv file
>  
>  
> val df = HiveContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("/data/stg/table2")
>  
> val a = df.map(x => (x.getString(0), x.getString(1), x.getString(2).substring(1).replace(",", "").toDouble,x.getString(3).substring(1).replace(",", "").toDouble, x.getString(4).substring(1).replace(",", "").toDouble))
>  
>  
> For most rows I am reading from csv file the above mapping works fine. However, at the bottom of csv there are couple of empty columns as below
>  
> [421,02/10/2015,?1,187.50,?237.50,?1,425.00]
> [,,,,]
> [Net income,,?182,531.25,?14,606.25,?197,137.50]
> [,,,,]
> [year 2014,,?113,500.00,?0.00,?113,500.00]
> [Year 2015,,?69,031.25,?14,606.25,?83,637.50]
>  
> However, I get 
>  
> a.collect.foreach(println)
> 16/02/20 08:31:53 ERROR Executor: Exception in task 0.0 in stage 123.0 (TID 161)
> java.lang.StringIndexOutOfBoundsException: String index out of range: -1
>  
> I suspect the cause is substring operation  say x.getString(2).substring(1) on empty values that according to web will throw this type of error
>  
>  
> The easiest solution seems to be to check whether x above is not null and do the substring operation. Can this be done without using a UDF?
>  
> Thanks
>  
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.


RE: Checking for null values when mapping

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Ok it turned out a bit less complicated than I thought :).  I would be interested if creating temporary table from DF and pushing data into Hive the best option here?

 

1.    Prepare and clean up data with filter & map

2.    Convert the RDD to DF

3.    Create temporary table from DF

4.    Use Hive database

5.    Drop if exists and create ORC table in Hive database

6.    Simple Insert/select from temporary table to Hive table

 

//

// Get a DF first based on Databricks CSV libraries

//

val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("/data/stg/table2")

//

// Next filter out empty rows (last colum has to be > "" and get rid of "?" special character. Also get rid of "," in money fields

// Example csv cell £,500.00 --> need to transform to plain 2500.00

//

val a = df.filter(col("Total") > "").map(x => (x.getString(0),x.getString(1), x.getString(2).substring(1).replace(",", "").toDouble, x.getString(3).substring(1).replace(",", "").toDouble, x.getString(4).substring(1).replace(",", "").toDouble))

 

a.first

//

// convert this RDD to DF and create a Spark temporary table

//

a.toDF.registerTempTable("tmp")

//

// Need to create and populate target ORC table t3 in database test in Hive

//

sql("use test")

//

// Drop and create table t3

//

sql("DROP TABLE IF EXISTS t3")

var sqltext : String = ""

sqltext = """

CREATE TABLE t3 (

INVOICENUMBER          INT

,PAYMENTDATE            timestamp

,NET                    DECIMAL(20,2)

,VAT                    DECIMAL(20,2)

,TOTAL                  DECIMAL(20,2)

)

COMMENT 'from csv file from excel sheet'

STORED AS ORC

TBLPROPERTIES ( "orc.compress"="ZLIB" )

"""

sql(sqltext)

//

// Put data in Hive table. Clean up is already done

//

sqltext = "INSERT INTO t3 SELECT * FROM tmp"

sql(sqltext)

sql("SELECT * FROM t3 ORDER BY 1").collect.foreach(println)

sys.exit()

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.

 

 

From: Ted Yu [mailto:yuzhihong@gmail.com] 
Sent: 20 February 2016 12:33
To: Mich Talebzadeh <mi...@peridale.co.uk>
Cc: Michał Zieliński <zi...@gmail.com>; user @spark <us...@spark.apache.org>
Subject: Re: Checking for null values when mapping

 

For #2, you can filter out row whose first column has length 0. 

 

Cheers


On Feb 20, 2016, at 6:59 AM, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk> > wrote:

Thanks

 

 

So what I did was

 

scala> val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("/data/stg/table2")

df: org.apache.spark.sql.DataFrame = [Invoice Number: string, Payment date: string, Net: string, VAT: string, Total: string]

 

 

scala> df.printSchema

root

|-- Invoice Number: string (nullable = true)

|-- Payment date: string (nullable = true)

|-- Net: string (nullable = true)

|-- VAT: string (nullable = true)

|-- Total: string (nullable = true)

 

 

So all the columns are Strings

 

Then I tried to exclude null rows by filtering on all columns not being null and map the rest

 

scala> val a = df.filter(col("Invoice Number").isNotNull and col("Payment date").isNotNull and col("Net").isNotNull and col("VAT").isNotNull and col("Total").isNotNull).map(x => (x.getString(1),x.getString(2).substring(1).replace(",", "").toDouble,x.getString(3).substring(1).replace(",", "").toDouble, x.getString(4).substring(1).replace(",", "").toDouble))

 

a: org.apache.spark.rdd.RDD[(String, Double, Double, Double)] = MapPartitionsRDD[176] at map at <console>:21

 

This still comes up with “String index out of range: “ error

 

16/02/20 11:50:51 ERROR Executor: Exception in task 0.0 in stage 12.0 (TID 18)

java.lang.StringIndexOutOfBoundsException: String index out of range: -1

 

My questions are:

 

1.      Doing the map,  map(x => (x.getString(1)  -- Can I replace x.getString(1) with the actual column name say “Invoice Number” and so forth for other columns as well?

2.      Sounds like it crashes because of these columns below at the end

[421,02/10/2015,?1,187.50,?237.50,?1,425.00]  \\ example good one

[,,,,] \\ bad one, empty one

[Net income,,?182,531.25,?14,606.25,?197,137.50] 

[,,,,] \\ bad one, empty one

[year 2014,,?113,500.00,?0.00,?113,500.00]

[Year 2015,,?69,031.25,?14,606.25,?83,637.50]

 

3.      Also to clarify I want to drop those two empty line -> [,,,,]  if I can. Unfortunately  drop() call does not work

a.drop()

<console>:24: error: value drop is not a member of org.apache.spark.rdd.RDD[(String, Double, Double, Double)]

              a.drop()

                ^

 

Thanka again,

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.

 

 

 


Re: Checking for null values when mapping

Posted by Ted Yu <yu...@gmail.com>.
For #2, you can filter out row whose first column has length 0. 

Cheers

> On Feb 20, 2016, at 6:59 AM, Mich Talebzadeh <mi...@peridale.co.uk> wrote:
> 
> Thanks
>  
>  
> So what I did was
>  
> scala> val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("/data/stg/table2")
> df: org.apache.spark.sql.DataFrame = [Invoice Number: string, Payment date: string, Net: string, VAT: string, Total: string]
>  
>  
> scala> df.printSchema
> root
> |-- Invoice Number: string (nullable = true)
> |-- Payment date: string (nullable = true)
> |-- Net: string (nullable = true)
> |-- VAT: string (nullable = true)
> |-- Total: string (nullable = true)
>  
>  
> So all the columns are Strings
>  
> Then I tried to exclude null rows by filtering on all columns not being null and map the rest
>  
> scala> val a = df.filter(col("Invoice Number").isNotNull and col("Payment date").isNotNull and col("Net").isNotNull and col("VAT").isNotNull and col("Total").isNotNull).map(x => (x.getString(1),x.getString(2).substring(1).replace(",", "").toDouble,x.getString(3).substring(1).replace(",", "").toDouble, x.getString(4).substring(1).replace(",", "").toDouble))
>  
> a: org.apache.spark.rdd.RDD[(String, Double, Double, Double)] = MapPartitionsRDD[176] at map at <console>:21
>  
> This still comes up with “String index out of range: “ error
>  
> 16/02/20 11:50:51 ERROR Executor: Exception in task 0.0 in stage 12.0 (TID 18)
> java.lang.StringIndexOutOfBoundsException: String index out of range: -1
>  
> My questions are:
>  
> 1.    Doing the map,  map(x => (x.getString(1)  -- Can I replace x.getString(1) with the actual column name say “Invoice Number” and so forth for other columns as well?
> 2.       Sounds like it crashes because of these columns below at the end
> [421,02/10/2015,?1,187.50,?237.50,?1,425.00]  \\ example good one
> [,,,,] \\ bad one, empty one
> [Net income,,?182,531.25,?14,606.25,?197,137.50]
> [,,,,] \\ bad one, empty one
> [year 2014,,?113,500.00,?0.00,?113,500.00]
> [Year 2015,,?69,031.25,?14,606.25,?83,637.50]
>  
> 3.    Also to clarify I want to drop those two empty line -> [,,,,]  if I can. Unfortunately  drop() call does not work
> a.drop()
> <console>:24: error: value drop is not a member of org.apache.spark.rdd.RDD[(String, Double, Double, Double)]
>               a.drop()
>                 ^
>  
> Thanka again,
>  
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.
>  
>  
> From: Michał Zieliński [mailto:zielinski.michal0@gmail.com] 
> Sent: 20 February 2016 08:59
> To: Mich Talebzadeh <mi...@peridale.co.uk>
> Cc: user @spark <us...@spark.apache.org>
> Subject: Re: Checking for null values when mapping
>  
> You can use filter and isNotNull on Column before the map.
>  
> On 20 February 2016 at 08:24, Mich Talebzadeh <mi...@peridale.co.uk> wrote:
>  
> I have a DF like below reading a csv file
>  
>  
> val df = HiveContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("/data/stg/table2")
>  
> val a = df.map(x => (x.getString(0), x.getString(1), x.getString(2).substring(1).replace(",", "").toDouble,x.getString(3).substring(1).replace(",", "").toDouble, x.getString(4).substring(1).replace(",", "").toDouble))
>  
>  
> For most rows I am reading from csv file the above mapping works fine. However, at the bottom of csv there are couple of empty columns as below
>  
> [421,02/10/2015,?1,187.50,?237.50,?1,425.00]
> [,,,,]
> [Net income,,?182,531.25,?14,606.25,?197,137.50]
> [,,,,]
> [year 2014,,?113,500.00,?0.00,?113,500.00]
> [Year 2015,,?69,031.25,?14,606.25,?83,637.50]
>  
> However, I get
>  
> a.collect.foreach(println)
> 16/02/20 08:31:53 ERROR Executor: Exception in task 0.0 in stage 123.0 (TID 161)
> java.lang.StringIndexOutOfBoundsException: String index out of range: -1
>  
> I suspect the cause is substring operation  say x.getString(2).substring(1) on empty values that according to web will throw this type of error
>  
>  
> The easiest solution seems to be to check whether x above is not null and do the substring operation. Can this be done without using a UDF?
>  
> Thanks
>  
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.
>  
>  
>  

RE: Checking for null values when mapping

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Thanks

 

 

So what I did was

 

scala> val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("/data/stg/table2")

df: org.apache.spark.sql.DataFrame = [Invoice Number: string, Payment date: string, Net: string, VAT: string, Total: string]

 

 

scala> df.printSchema

root

|-- Invoice Number: string (nullable = true)

|-- Payment date: string (nullable = true)

|-- Net: string (nullable = true)

|-- VAT: string (nullable = true)

|-- Total: string (nullable = true)

 

 

So all the columns are Strings

 

Then I tried to exclude null rows by filtering on all columns not being null and map the rest

 

scala> val a = df.filter(col("Invoice Number").isNotNull and col("Payment date").isNotNull and col("Net").isNotNull and col("VAT").isNotNull and col("Total").isNotNull).map(x => (x.getString(1),x.getString(2).substring(1).replace(",", "").toDouble,x.getString(3).substring(1).replace(",", "").toDouble, x.getString(4).substring(1).replace(",", "").toDouble))

 

a: org.apache.spark.rdd.RDD[(String, Double, Double, Double)] = MapPartitionsRDD[176] at map at <console>:21

 

This still comes up with “String index out of range: “ error

 

16/02/20 11:50:51 ERROR Executor: Exception in task 0.0 in stage 12.0 (TID 18)

java.lang.StringIndexOutOfBoundsException: String index out of range: -1

 

My questions are:

 

1.    Doing the map,  map(x => (x.getString(1)  -- Can I replace x.getString(1) with the actual column name say “Invoice Number” and so forth for other columns as well?

2.       Sounds like it crashes because of these columns below at the end

[421,02/10/2015,?1,187.50,?237.50,?1,425.00]  \\ example good one

[,,,,] \\ bad one, empty one

[Net income,,?182,531.25,?14,606.25,?197,137.50] 

[,,,,] \\ bad one, empty one

[year 2014,,?113,500.00,?0.00,?113,500.00]

[Year 2015,,?69,031.25,?14,606.25,?83,637.50]

 

3.    Also to clarify I want to drop those two empty line -> [,,,,]  if I can. Unfortunately  drop() call does not work

a.drop()

<console>:24: error: value drop is not a member of org.apache.spark.rdd.RDD[(String, Double, Double, Double)]

              a.drop()

                ^

 

Thanka again,

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.

 

 

From: Michał Zieliński [mailto:zielinski.michal0@gmail.com] 
Sent: 20 February 2016 08:59
To: Mich Talebzadeh <mi...@peridale.co.uk>
Cc: user @spark <us...@spark.apache.org>
Subject: Re: Checking for null values when mapping

 

You can use filter and isNotNull on  <https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.Column> Column before the map.

 

On 20 February 2016 at 08:24, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk> > wrote:

 

I have a DF like below reading a csv file

 

 

val df = HiveContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("/data/stg/table2")

 

val a = df.map(x => (x.getString(0), x.getString(1), x.getString(2).substring(1).replace(",", "").toDouble,x.getString(3).substring(1).replace(",", "").toDouble, x.getString(4).substring(1).replace(",", "").toDouble))

 

 

For most rows I am reading from csv file the above mapping works fine. However, at the bottom of csv there are couple of empty columns as below

 

[421,02/10/2015,?1,187.50,?237.50,?1,425.00]

[,,,,]

[Net income,,?182,531.25,?14,606.25,?197,137.50]

[,,,,]

[year 2014,,?113,500.00,?0.00,?113,500.00]

[Year 2015,,?69,031.25,?14,606.25,?83,637.50]

 

However, I get 

 

a.collect.foreach(println)

16/02/20 08:31:53 ERROR Executor: Exception in task 0.0 in stage 123.0 (TID 161)

java.lang.StringIndexOutOfBoundsException: String index out of range: -1

 

I suspect the cause is substring operation  say x.getString(2).substring(1) on empty values that according to web will throw this type of error

 

 

The easiest solution seems to be to check whether x above is not null and do the substring operation. Can this be done without using a UDF?

 

Thanks

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.

 

 

 


Re: Checking for null values when mapping

Posted by Michał Zieliński <zi...@gmail.com>.
You can use filter and isNotNull on Column
<https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.Column>
before the map.

On 20 February 2016 at 08:24, Mich Talebzadeh <mi...@peridale.co.uk> wrote:

>
>
> I have a DF like below reading a csv file
>
>
>
>
>
> val df =
> HiveContext.read.format("com.databricks.spark.csv").option("inferSchema",
> "true").option("header", "true").load("/data/stg/table2")
>
>
>
> val a = df.map(x => (x.getString(0), x.getString(1),
> *x.getString(2).substring(1)*.replace(",",
> "").toDouble,x.getString(3).substring(1).replace(",", "").toDouble,
> x.getString(4).substring(1).replace(",", "").toDouble))
>
>
>
>
>
> For most rows I am reading from csv file the above mapping works fine.
> However, at the bottom of csv there are couple of empty columns as below
>
>
>
> [421,02/10/2015,?1,187.50,?237.50,?1,425.00]
>
> [,,,,]
>
> [Net income,,?182,531.25,?14,606.25,?197,137.50]
>
> [,,,,]
>
> [year 2014,,?113,500.00,?0.00,?113,500.00]
>
> [Year 2015,,?69,031.25,?14,606.25,?83,637.50]
>
>
>
> However, I get
>
>
>
> a.collect.foreach(println)
>
> 16/02/20 08:31:53 ERROR Executor: Exception in task 0.0 in stage 123.0
> (TID 161)
>
> java.lang.StringIndexOutOfBoundsException: String index out of range: -1
>
>
>
> I suspect the cause is substring operation  say
> x.getString(2).substring(1) on empty values that according to web will
> throw this type of error
>
>
>
>
>
> The easiest solution seems to be to check whether x above is not null and
> do the substring operation. Can this be done without using a UDF?
>
>
>
> Thanks
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
>
>