You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mich Talebzadeh <mi...@gmail.com> on 2016/09/27 20:49:16 UTC

Issue with rogue data in csv file used in Spark application

I have historical prices for various stocks.

Each csv file has 10 years trade one row per each day.

These are the columns defined in the class

case class columns(Stock: String, Ticker: String, TradeDate: String, Open:
Float, High: Float, Low: Float, Close: Float, Volume: Integer)

The issue is with Open, High, Low, Close columns that all are defined as
Float.

Most rows are OK like below but the red one with "-" defined as Float
causes issues

  Date     Open High  Low   Close Volume
27-Sep-16 80.91 80.93 79.87 80.85 1873158
23-Dec-11   -     -    -    40.56 0

Because the prices are defined as Float, these rows cause the application
to crash
scala> val rs = df2.filter(changeToDate("TradeDate") >=
monthsago).select((changeToDate("TradeDate").as("TradeDate")),(('Close+'Open)/2).as("AverageDailyPrice"),
'Low.as("Day's Low"), 'High.as("Day's High")).orderBy("TradeDate").collect
16/09/27 21:48:53 ERROR Executor: Exception in task 0.0 in stage 61.0 (TID
260)
java.lang.NumberFormatException: For input string: "-"


One way is to define the prices as Strings but that is not
meaningful. Alternatively do the clean up before putting csv in HDFS but
that becomes tedious and error prone.

Any ideas will be appreciated.


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Re: Issue with rogue data in csv file used in Spark application

Posted by Hyukjin Kwon <gu...@gmail.com>.

Hi Mich,

I guess you could use nullValue option by setting it to null.

If you are reading them into strings at the first please, then, you would
meet https://github.com/apache/spark/pull/14118 first which is resolved
from 2.0.1

Unfortunately, this bug also exists in external csv library for strings if
I recall correctly.

However, it'd be fine if you set the schema explicitly when you load as
this bug does not exists for floats at least.

I hope this is helpful.

Thanks!

On 28 Sep 2016 7:06 a.m., "Mich Talebzadeh" <mi...@gmail.com>
wrote:

> Thanks guys
>
> Actually these are the 7 rogue rows. The column 0 is the Volume column
> which means there was no trades on those days
>
>
> *cat stock.csv|grep ",0"*SAP SE,SAP, 23-Dec-11,-,-,-,40.56,0
> SAP SE,SAP, 21-Apr-11,-,-,-,45.85,0
> SAP SE,SAP, 30-Dec-10,-,-,-,38.10,0
> SAP SE,SAP, 23-Dec-10,-,-,-,38.36,0
> SAP SE,SAP, 30-Apr-08,-,-,-,32.39,0
> SAP SE,SAP, 29-Apr-08,-,-,-,33.05,0
> SAP SE,SAP, 28-Apr-08,-,-,-,32.60,0
>
> So one way would be to exclude the rows that there was no volume of trade
> that day when cleaning up the csv file
>
> *cat stock.csv|grep -v **",0"*
>
> and that works. Bearing in mind that putting 0s in place of "-" will skew
> the price plot.
>
> BTW I am using Spark csv as well
>
> val df1 = spark.read.option("header", true).csv(location)
>
> This is the class and the mapping
>
>
> case class columns(Stock: String, Ticker: String, TradeDate: String, Open:
> Float, High: Float, Low: Float, Close: Float, Volume: Integer)
> val df2 = df1.map(p => columns(p(0).toString, p(1).toString,
> p(2).toString, p(3).toString.toFloat, p(4).toString.toFloat,
> p(5).toString.toFloat, p(6).toString.toFloat, p(7).toString.toInt))
>
>
> In here I have
>
> p(3).toString.toFloat
>
> How can one check for rogue data in p(3)?
>
>
> Thanks
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 27 September 2016 at 21:49, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>>
>> I have historical prices for various stocks.
>>
>> Each csv file has 10 years trade one row per each day.
>>
>> These are the columns defined in the class
>>
>> case class columns(Stock: String, Ticker: String, TradeDate: String,
>> Open: Float, High: Float, Low: Float, Close: Float, Volume: Integer)
>>
>> The issue is with Open, High, Low, Close columns that all are defined as
>> Float.
>>
>> Most rows are OK like below but the red one with "-" defined as Float
>> causes issues
>>
>>   Date     Open High  Low   Close Volume
>> 27-Sep-16 80.91 80.93 79.87 80.85 1873158
>> 23-Dec-11   -     -    -    40.56 0
>>
>> Because the prices are defined as Float, these rows cause the application
>> to crash
>> scala> val rs = df2.filter(changeToDate("TradeDate") >=
>> monthsago).select((changeToDate("TradeDate").as("TradeDate")
>> ),(('Close+'Open)/2).as("AverageDailyPrice"), 'Low.as("Day's Low"),
>> 'High.as("Day's High")).orderBy("TradeDate").collect
>> 16/09/27 21:48:53 ERROR Executor: Exception in task 0.0 in stage 61.0
>> (TID 260)
>> java.lang.NumberFormatException: For input string: "-"
>>
>>
>> One way is to define the prices as Strings but that is not
>> meaningful. Alternatively do the clean up before putting csv in HDFS but
>> that becomes tedious and error prone.
>>
>> Any ideas will be appreciated.
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>
>

Re: Issue with rogue data in csv file used in Spark application

Posted by Mich Talebzadeh <mi...@gmail.com>.

Thanks guys.

This seemed to be working after declaring all columns as Strings to start
and using filters below to avoid rogue characters. The second filter
ensures that there was trade volumes on that date.

val *rs = df2.filter($"Open" !== "-").filter($"Volume".cast("Integer")
> 0*).filter(changeToDate("TradeDate")
>=
monthsago).select((changeToDate("TradeDate").as("TradeDate")),(round(($"Close".cast("Float")+$"Open".cast("Float"))/2,2)).as("AverageDailyPrice"),
$"Low".cast("Float").as("Day's Low"), $"High".cast("Float")as("Day's
High")).orderBy(changeToDate("TradeDate"))

Cheers


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 28 September 2016 at 14:45, Bedrytski Aliaksandr <sp...@bedryt.ski>
wrote:

> Hi Mich,
>
> if I understood you well, you may cast the value to float, it will yield
> null if the value is not a correct float:
>
> val df = Seq(("-", 5), ("1", 6), (",", 7), ("8.6", 7)).toDF("value",
> "id").createOrReplaceTempView("lines")
>
> spark.sql("SELECT cast(value as FLOAT) from lines").show()
>
> +-----+
> |value|
> +-----+
> | null|
> | 1.0 |
> | null|
> | 8.6 |
> +-----+
>
> After it you may filter the DataFrame for values containing null.
>
> Regards,
> --
>   Bedrytski Aliaksandr
>   spark@bedryt.ski
>
>
>
> On Wed, Sep 28, 2016, at 10:11, Mich Talebzadeh wrote:
>
> Thanks all.
>
> This is the csv schema all columns mapped to String
>
> scala> df2.printSchema
> root
>  |-- Stock: string (nullable = true)
>  |-- Ticker: string (nullable = true)
>  |-- TradeDate: string (nullable = true)
>  |-- Open: string (nullable = true)
>  |-- High: string (nullable = true)
>  |-- Low: string (nullable = true)
>  |-- Close: string (nullable = true)
>  |-- Volume: string (nullable = true)
>
> The issue I have can be shown as below
>
> df2.filter( $"OPen" === "-").select((changeToDate("TradeDate").as("TradeDate")),
> 'Open, 'High, 'Low, 'Close, 'Volume).show
>
> +----------+----+----+---+-----+------+
> | TradeDate|Open|High|Low|Close|Volume|
> +----------+----+----+---+-----+------+
> |2011-12-23|   -|   -|  -|40.56|     0|
> |2011-04-21|   -|   -|  -|45.85|     0|
> |2010-12-30|   -|   -|  -|38.10|     0|
> |2010-12-23|   -|   -|  -|38.36|     0|
> |2008-04-30|   -|   -|  -|32.39|     0|
> |2008-04-29|   -|   -|  -|33.05|     0|
> |2008-04-28|   -|   -|  -|32.60|     0|
> +----------+----+----+---+-----+------+
> Now there are ways of dealing with this. However, the solution has to be
> generic! Checking for a column == "-" is not generic. How about if that
> column was "," etc.
>
> This is an issue in most databases. Specifically if a field is NaN.. --> (
> *NaN*, standing for not a number, is a numeric data type value
> representing an undefined or unrepresentable value, especially in
> floating-point calculations)
>
> Spark handles this
> <https://spark.apache.org/docs/1.5.1/api/java/org/apache/spark/sql/DataFrameNaFunctions.html>.
> I am on  Spark 2.0.1  in Class DataFrameNaFunctions. The simplest one is to
> drop these rogue rows
>
> df2.filter( $"Open" === "-").drop()
>
> However, a better approach would be to use REPLACE method or testing any
> column for NaN
>
>
>
>
> There is a method called isnan(). However, it does not return correct
> values!
>
>  df2.filter(isnan($"Open")).show
> +-----+------+---------+----+----+---+-----+------+
> |Stock|Ticker|TradeDate|Open|High|Low|Close|Volume|
> +-----+------+---------+----+----+---+-----+------+
> +-----+------+---------+----+----+---+-----+------+
>
>
> Any suggestions?
>
> Thanks
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On 28 September 2016 at 04:07, Mike Metzger <mi...@flexiblecreations.com>
> wrote:
>
> Hi Mich -
>
>    Can you run a filter command on df1 prior to your map for any rows
> where p(3).toString != '-' then run your map command?
>
> Thanks
>
>
> Mike
>
>
> On Tue, Sep 27, 2016 at 5:06 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
> Thanks guys
>
> Actually these are the 7 rogue rows. The column 0 is the Volume column
> which means there was no trades on those days
>
>
> *cat stock.csv|grep ",0"*SAP SE,SAP, 23-Dec-11,-,-,-,40.56,0
> SAP SE,SAP, 21-Apr-11,-,-,-,45.85,0
> SAP SE,SAP, 30-Dec-10,-,-,-,38.10,0
> SAP SE,SAP, 23-Dec-10,-,-,-,38.36,0
> SAP SE,SAP, 30-Apr-08,-,-,-,32.39,0
> SAP SE,SAP, 29-Apr-08,-,-,-,33.05,0
> SAP SE,SAP, 28-Apr-08,-,-,-,32.60,0
>
> So one way would be to exclude the rows that there was no volume of trade
> that day when cleaning up the csv file
>
> *cat stock.csv|grep -v ",0"*
>
> and that works. Bearing in mind that putting 0s in place of "-" will skew
> the price plot.
>
> BTW I am using Spark csv as well
>
> val df1 = spark.read.option("header", true).csv(location)
>
> This is the class and the mapping
>
>
> case class columns(Stock: String, Ticker: String, TradeDate: String, Open:
> Float, High: Float, Low: Float, Close: Float, Volume: Integer)
> val df2 = df1.map(p => columns(p(0).toString, p(1).toString,
> p(2).toString, p(3).toString.toFloat, p(4).toString.toFloat,
> p(5).toString.toFloat, p(6).toString.toFloat, p(7).toString.toInt))
>
>
>
> In here I have
>
> p(3).toString.toFloat
>
>
> How can one check for rogue data in p(3)?
>
>
> Thanks
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On 27 September 2016 at 21:49, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>
> I have historical prices for various stocks.
>
> Each csv file has 10 years trade one row per each day.
>
> These are the columns defined in the class
>
> case class columns(Stock: String, Ticker: String, TradeDate: String, Open:
> Float, High: Float, Low: Float, Close: Float, Volume: Integer)
>
> The issue is with Open, High, Low, Close columns that all are defined as
> Float.
>
> Most rows are OK like below but the red one with "-" defined as Float
> causes issues
>
>   Date     Open High  Low   Close Volume
> 27-Sep-16 80.91 80.93 79.87 80.85 1873158
> 23-Dec-11   -     -    -    40.56 0
>
> Because the prices are defined as Float, these rows cause the application
> to crash
> scala> val rs = df2.filter(changeToDate("TradeDate") >=
> monthsago).select((changeToDate("TradeDate").as("TradeDate")
> ),(('Close+'Open)/2).as("AverageDailyPrice"), 'Low.as("Day's Low"),
> 'High.as("Day's High")).orderBy("TradeDate").collect
> 16/09/27 21:48:53 ERROR Executor: Exception in task 0.0 in stage 61.0 (TID
> 260)
> java.lang.NumberFormatException: For input string: "-"
>
>
>
>
> One way is to define the prices as Strings but that is not
> meaningful. Alternatively do the clean up before putting csv in HDFS but
> that becomes tedious and error prone.
>
> Any ideas will be appreciated.
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>

Re: Issue with rogue data in csv file used in Spark application

Posted by Bedrytski Aliaksandr <sp...@bedryt.ski>.

Hi Mich,

if I understood you well, you may cast the value to float, it will yield
null if the value is not a correct float:

val df = Seq(("-", 5), ("1", 6), (",", 7), ("8.6", 7)).toDF("value",
"id").createOrReplaceTempView("lines")

spark.sql("SELECT cast(value as FLOAT) from lines").show()

+-----+
|value|
+-----+
| null|
|  1. |
| null|
|  8.6 |
+-----+

After it you may filter the DataFrame for values containing null.

Regards,
--
  Bedrytski Aliaksandr
  spark@bedryt.ski



On Wed, Sep 28, 2016, at 10:11, Mich Talebzadeh wrote:
> Thanks all.
>
> This is the csv schema all columns mapped to String
>
> scala> df2.printSchema
> root
>  |-- Stock: string (nullable = true) -- Ticker: string (nullable =
>  |true) -- TradeDate: string (nullable = true) -- Open: string
>  |(nullable = true) -- High: string (nullable = true) -- Low: string
>  |(nullable = true) -- Close: string (nullable = true) -- Volume:
>  |string (nullable = true)
>
> The issue I have can be shown as below
>
> df2.filter( $"OPen" === "-
> ").select((changeToDate("TradeDate").as("TradeDate")), 'Open, 'High,
> 'Low, 'Close, 'Volume).show
>
> +----------+----+----+---+-----+------+
> | TradeDate|Open|High|Low|Close|Volume|
> +----------+----+----+---+-----+------+
> |2011-12-23|   -|   -|  -|40.56|     0| 2011-04-21|   -|   -|  -
> ||45.85|     0| 2010-12-30|   -|   -|  -|38.10|     0| 2010-12-23|
> |-|   -|  -|38.36|     0| 2008-04-30|   -|   -|  -|32.39|     0| 2008-04-
> |29|   -|   -|  -|33.05|     0| 2008-04-28|   -|   -|  -|32.60|     0|
> +----------+----+----+---+-----+------+
> Now there are ways of dealing with this. However, the solution has to
> be generic! Checking for a column == "-" is not generic. How about if
> that column was "," etc.
>
> This is an issue in most databases. Specifically if a field is NaN..
> --> (*NaN*, standing for not a number, is a numeric data type value
> representing an undefined or unrepresentable value, especially in floating-
> point calculations)
>
> Spark handles this[1]. I am on  Spark 2.0.1  in Class
> DataFrameNaFunctions. The simplest one is to drop these rogue rows
> df2.filter( $"Open" === "-").drop()
> However, a better approach would be to use REPLACE method or testing
> any column for NaN
>
>
>
>
> There is a method called isnan(). However, it does not return
> correct values!
>
>  df2.filter(isnan($"Open")).show +-----+------+---------+----+----+---+-----+------
>  + |Stock|Ticker|TradeDate|Open|High|Low|Close|Volume| +-----+------+---------+----+----+---+-----+------
>  + +-----+------+---------+----+----+---+-----+------+
>
>
> Any suggestions?
>
> Thanks
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn *
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which
> may arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable for any monetary
> damages arising from such loss, damage or destruction.
>
>
>
>
> On 28 September 2016 at 04:07, Mike Metzger
> <mi...@flexiblecreations.com> wrote:
>> Hi Mich -
>>
>>    Can you run a filter command on df1 prior to your map for any rows
>>    where p(3).toString != '-' then run your map command?
>>
>> Thanks
>>
>>
>> Mike
>>
>>
>> On Tue, Sep 27, 2016 at 5:06 PM, Mich Talebzadeh
>> <mi...@gmail.com> wrote:
>>> Thanks guys
>>>
>>> Actually these are the 7 rogue rows. The column 0 is the Volume
>>> column  which means there was no trades on those days
>>>
>>> *cat stock.csv|grep ",0"
*SAP SE,SAP, 23-Dec-11,-,-,-,40.56,0
>>> SAP SE,SAP, 21-Apr-11,-,-,-,45.85,0 SAP SE,SAP, 30-Dec-10,-,-,-
>>> ,38.10,0 SAP SE,SAP, 23-Dec-10,-,-,-,38.36,0 SAP SE,SAP, 30-Apr-08,-,-,-
>>> ,32.39,0 SAP SE,SAP, 29-Apr-08,-,-,-,33.05,0 SAP SE,SAP, 28-Apr-08,-,-,-
>>> ,32.60,0
>>>
>>> So one way would be to exclude the rows that there was no volume of
>>> trade that day when cleaning up the csv file
>>>
>>> *cat stock.csv|grep -v ",0"*
>>>
>>> and that works. Bearing in mind that putting 0s in place of "-" will
>>> skew the price plot.
>>>
>>> BTW I am using Spark csv as well
>>>
>>> val df1 = spark.read.option("header", true).csv(location)
>>>
>>> This is the class and the mapping
>>>
>>> case class columns(Stock: String, Ticker: String, TradeDate: String,
>>> Open: Float, High: Float, Low: Float, Close: Float, Volume: Integer)
>>> val df2 = df1.map(p => columns(p(0).toString, p(1).toString,
>>> p(2).toString, p(3).toString.toFloat, p(4).toString.toFloat,
>>> p(5).toString.toFloat, p(6).toString.toFloat, p(7).toString.toInt))
>>>
>>>
>>>
>>> In here I have
>>>
>>> p(3).toString.toFloat
>>>
>>>
>>> How can one check for rogue data in p(3)?
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn *
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>> for any loss, damage or destruction of data or any other property
>>> which may arise from relying on this email's technical content is
>>> explicitly disclaimed. The author will in no case be liable for any
>>> monetary damages arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On 27 September 2016 at 21:49, Mich Talebzadeh
>>> <mi...@gmail.com> wrote:
>>>>
>>>> I have historical prices for various stocks.
>>>>
>>>> Each csv file has 10 years trade one row per each day.
>>>>
>>>> These are the columns defined in the class
>>>>
>>>> case class columns(Stock: String, Ticker: String, TradeDate:
>>>> String, Open: Float, High: Float, Low: Float, Close: Float, Volume:
>>>> Integer)
>>>>
>>>> The issue is with Open, High, Low, Close columns that all are
>>>> defined as Float.
>>>>
>>>> Most rows are OK like below but the red one with "-" defined as
>>>> Float causes issues
>>>>
>>>>   Date     Open High  Low   Close Volume 27-Sep-16 80.91 80.93
>>>>   79.87 80.85 1873158
>>>> 23-Dec-11   -     -    -    40.56 0
>>>>
>>>> Because the prices are defined as Float, these rows cause the
>>>> application to crash
>>>> scala> val rs = df2.filter(changeToDate("TradeDate") >= monthsago)-
>>>> scala> .select((changeToDate("TradeDate").as("TradeDate")),(('Clos-
>>>> scala> e+'Open)/2).as("AverageDailyPrice"), 'Low.as("Day's Low"),
>>>> scala> 'High.as("Day's High")).orderBy("TradeDate").collect
>>>> 16/09/27 21:48:53 ERROR Executor: Exception in task 0.0 in stage
>>>>       61.0 (TID 260) java.lang.NumberFormatException: For input
>>>>       string: "-"
>>>>
>>>>
>>>>
>>>>
>>>> One way is to define the prices as Strings but that is not
>>>> meaningful. Alternatively do the clean up before putting csv in
>>>> HDFS but that becomes tedious and error prone.
>>>>
>>>> Any ideas will be appreciated.
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn *
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>> for any loss, damage or destruction of data or any other property
>>>> which may arise from relying on this email's technical content is
>>>> explicitly disclaimed. The author will in no case be liable for any
>>>> monetary damages arising from such loss, damage or destruction.
>>>>
>>>>
>>>>


Links:

  1. https://spark.apache.org/docs/1.5.1/api/java/org/apache/spark/sql/DataFrameNaFunctions.html

Re: Issue with rogue data in csv file used in Spark application

Posted by Mich Talebzadeh <mi...@gmail.com>.

Thanks all.

This is the csv schema all columns mapped to String

scala> df2.printSchema
root
 |-- Stock: string (nullable = true)
 |-- Ticker: string (nullable = true)
 |-- TradeDate: string (nullable = true)
 |-- Open: string (nullable = true)
 |-- High: string (nullable = true)
 |-- Low: string (nullable = true)
 |-- Close: string (nullable = true)
 |-- Volume: string (nullable = true)

The issue I have can be shown as below

df2.filter( $"OPen" ===
"-").select((changeToDate("TradeDate").as("TradeDate")),
'Open, 'High, 'Low, 'Close, 'Volume).show

+----------+----+----+---+-----+------+
| TradeDate|Open|High|Low|Close|Volume|
+----------+----+----+---+-----+------+
|2011-12-23|   -|   -|  -|40.56|     0|
|2011-04-21|   -|   -|  -|45.85|     0|
|2010-12-30|   -|   -|  -|38.10|     0|
|2010-12-23|   -|   -|  -|38.36|     0|
|2008-04-30|   -|   -|  -|32.39|     0|
|2008-04-29|   -|   -|  -|33.05|     0|
|2008-04-28|   -|   -|  -|32.60|     0|
+----------+----+----+---+-----+------+

Now there are ways of dealing with this. However, the solution has to be
generic! Checking for a column == "-" is not generic. How about if that
column was "," etc.

This is an issue in most databases. Specifically if a field is NaN.. --> (
*NaN*, standing for not a number, is a numeric data type value representing
an undefined or unrepresentable value, especially in floating-point
calculations)

Spark handles this
<https://spark.apache.org/docs/1.5.1/api/java/org/apache/spark/sql/DataFrameNaFunctions.html>.
I am on  Spark 2.0.1  in Class DataFrameNaFunctions. The simplest one is to
drop these rogue rows

df2.filter( $"Open" === "-").drop()

However, a better approach would be to use REPLACE method or testing any
column for NaN



There is a method called isnan(). However, it does not return correct
values!

 df2.filter(isnan($"Open")).show
+-----+------+---------+----+----+---+-----+------+
|Stock|Ticker|TradeDate|Open|High|Low|Close|Volume|
+-----+------+---------+----+----+---+-----+------+
+-----+------+---------+----+----+---+-----+------+

Any suggestions?

Thanks



Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 28 September 2016 at 04:07, Mike Metzger <mi...@flexiblecreations.com>
wrote:

> Hi Mich -
>
>    Can you run a filter command on df1 prior to your map for any rows
> where p(3).toString != '-' then run your map command?
>
> Thanks
>
> Mike
>
> On Tue, Sep 27, 2016 at 5:06 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> Thanks guys
>>
>> Actually these are the 7 rogue rows. The column 0 is the Volume column
>> which means there was no trades on those days
>>
>>
>> *cat stock.csv|grep ",0"*SAP SE,SAP, 23-Dec-11,-,-,-,40.56,0
>> SAP SE,SAP, 21-Apr-11,-,-,-,45.85,0
>> SAP SE,SAP, 30-Dec-10,-,-,-,38.10,0
>> SAP SE,SAP, 23-Dec-10,-,-,-,38.36,0
>> SAP SE,SAP, 30-Apr-08,-,-,-,32.39,0
>> SAP SE,SAP, 29-Apr-08,-,-,-,33.05,0
>> SAP SE,SAP, 28-Apr-08,-,-,-,32.60,0
>>
>> So one way would be to exclude the rows that there was no volume of trade
>> that day when cleaning up the csv file
>>
>> *cat stock.csv|grep -v **",0"*
>>
>> and that works. Bearing in mind that putting 0s in place of "-" will skew
>> the price plot.
>>
>> BTW I am using Spark csv as well
>>
>> val df1 = spark.read.option("header", true).csv(location)
>>
>> This is the class and the mapping
>>
>>
>> case class columns(Stock: String, Ticker: String, TradeDate: String,
>> Open: Float, High: Float, Low: Float, Close: Float, Volume: Integer)
>> val df2 = df1.map(p => columns(p(0).toString, p(1).toString,
>> p(2).toString, p(3).toString.toFloat, p(4).toString.toFloat,
>> p(5).toString.toFloat, p(6).toString.toFloat, p(7).toString.toInt))
>>
>>
>> In here I have
>>
>> p(3).toString.toFloat
>>
>> How can one check for rogue data in p(3)?
>>
>>
>> Thanks
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 27 September 2016 at 21:49, Mich Talebzadeh <mich.talebzadeh@gmail.com
>> > wrote:
>>
>>>
>>> I have historical prices for various stocks.
>>>
>>> Each csv file has 10 years trade one row per each day.
>>>
>>> These are the columns defined in the class
>>>
>>> case class columns(Stock: String, Ticker: String, TradeDate: String,
>>> Open: Float, High: Float, Low: Float, Close: Float, Volume: Integer)
>>>
>>> The issue is with Open, High, Low, Close columns that all are defined as
>>> Float.
>>>
>>> Most rows are OK like below but the red one with "-" defined as Float
>>> causes issues
>>>
>>>   Date     Open High  Low   Close Volume
>>> 27-Sep-16 80.91 80.93 79.87 80.85 1873158
>>> 23-Dec-11   -     -    -    40.56 0
>>>
>>> Because the prices are defined as Float, these rows cause the
>>> application to crash
>>> scala> val rs = df2.filter(changeToDate("TradeDate") >=
>>> monthsago).select((changeToDate("TradeDate").as("TradeDate")
>>> ),(('Close+'Open)/2).as("AverageDailyPrice"), 'Low.as("Day's Low"),
>>> 'High.as("Day's High")).orderBy("TradeDate").collect
>>> 16/09/27 21:48:53 ERROR Executor: Exception in task 0.0 in stage 61.0
>>> (TID 260)
>>> java.lang.NumberFormatException: For input string: "-"
>>>
>>>
>>> One way is to define the prices as Strings but that is not
>>> meaningful. Alternatively do the clean up before putting csv in HDFS but
>>> that becomes tedious and error prone.
>>>
>>> Any ideas will be appreciated.
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>
>>
>

Re: Issue with rogue data in csv file used in Spark application

Posted by Mike Metzger <mi...@flexiblecreations.com>.

Hi Mich -

   Can you run a filter command on df1 prior to your map for any rows where
p(3).toString != '-' then run your map command?

Thanks

Mike

On Tue, Sep 27, 2016 at 5:06 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Thanks guys
>
> Actually these are the 7 rogue rows. The column 0 is the Volume column
> which means there was no trades on those days
>
>
> *cat stock.csv|grep ",0"*SAP SE,SAP, 23-Dec-11,-,-,-,40.56,0
> SAP SE,SAP, 21-Apr-11,-,-,-,45.85,0
> SAP SE,SAP, 30-Dec-10,-,-,-,38.10,0
> SAP SE,SAP, 23-Dec-10,-,-,-,38.36,0
> SAP SE,SAP, 30-Apr-08,-,-,-,32.39,0
> SAP SE,SAP, 29-Apr-08,-,-,-,33.05,0
> SAP SE,SAP, 28-Apr-08,-,-,-,32.60,0
>
> So one way would be to exclude the rows that there was no volume of trade
> that day when cleaning up the csv file
>
> *cat stock.csv|grep -v **",0"*
>
> and that works. Bearing in mind that putting 0s in place of "-" will skew
> the price plot.
>
> BTW I am using Spark csv as well
>
> val df1 = spark.read.option("header", true).csv(location)
>
> This is the class and the mapping
>
>
> case class columns(Stock: String, Ticker: String, TradeDate: String, Open:
> Float, High: Float, Low: Float, Close: Float, Volume: Integer)
> val df2 = df1.map(p => columns(p(0).toString, p(1).toString,
> p(2).toString, p(3).toString.toFloat, p(4).toString.toFloat,
> p(5).toString.toFloat, p(6).toString.toFloat, p(7).toString.toInt))
>
>
> In here I have
>
> p(3).toString.toFloat
>
> How can one check for rogue data in p(3)?
>
>
> Thanks
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 27 September 2016 at 21:49, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>>
>> I have historical prices for various stocks.
>>
>> Each csv file has 10 years trade one row per each day.
>>
>> These are the columns defined in the class
>>
>> case class columns(Stock: String, Ticker: String, TradeDate: String,
>> Open: Float, High: Float, Low: Float, Close: Float, Volume: Integer)
>>
>> The issue is with Open, High, Low, Close columns that all are defined as
>> Float.
>>
>> Most rows are OK like below but the red one with "-" defined as Float
>> causes issues
>>
>>   Date     Open High  Low   Close Volume
>> 27-Sep-16 80.91 80.93 79.87 80.85 1873158
>> 23-Dec-11   -     -    -    40.56 0
>>
>> Because the prices are defined as Float, these rows cause the application
>> to crash
>> scala> val rs = df2.filter(changeToDate("TradeDate") >=
>> monthsago).select((changeToDate("TradeDate").as("TradeDate")
>> ),(('Close+'Open)/2).as("AverageDailyPrice"), 'Low.as("Day's Low"),
>> 'High.as("Day's High")).orderBy("TradeDate").collect
>> 16/09/27 21:48:53 ERROR Executor: Exception in task 0.0 in stage 61.0
>> (TID 260)
>> java.lang.NumberFormatException: For input string: "-"
>>
>>
>> One way is to define the prices as Strings but that is not
>> meaningful. Alternatively do the clean up before putting csv in HDFS but
>> that becomes tedious and error prone.
>>
>> Any ideas will be appreciated.
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>
>

Re: Issue with rogue data in csv file used in Spark application

Posted by Mich Talebzadeh <mi...@gmail.com>.

Thanks guys

Actually these are the 7 rogue rows. The column 0 is the Volume column
which means there was no trades on those days


*cat stock.csv|grep ",0"*SAP SE,SAP, 23-Dec-11,-,-,-,40.56,0
SAP SE,SAP, 21-Apr-11,-,-,-,45.85,0
SAP SE,SAP, 30-Dec-10,-,-,-,38.10,0
SAP SE,SAP, 23-Dec-10,-,-,-,38.36,0
SAP SE,SAP, 30-Apr-08,-,-,-,32.39,0
SAP SE,SAP, 29-Apr-08,-,-,-,33.05,0
SAP SE,SAP, 28-Apr-08,-,-,-,32.60,0

So one way would be to exclude the rows that there was no volume of trade
that day when cleaning up the csv file

*cat stock.csv|grep -v **",0"*

and that works. Bearing in mind that putting 0s in place of "-" will skew
the price plot.

BTW I am using Spark csv as well

val df1 = spark.read.option("header", true).csv(location)

This is the class and the mapping


case class columns(Stock: String, Ticker: String, TradeDate: String, Open:
Float, High: Float, Low: Float, Close: Float, Volume: Integer)
val df2 = df1.map(p => columns(p(0).toString, p(1).toString, p(2).toString,
p(3).toString.toFloat, p(4).toString.toFloat, p(5).toString.toFloat,
p(6).toString.toFloat, p(7).toString.toInt))


In here I have

p(3).toString.toFloat

How can one check for rogue data in p(3)?


Thanks





Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 27 September 2016 at 21:49, Mich Talebzadeh <mi...@gmail.com>
wrote:

>
> I have historical prices for various stocks.
>
> Each csv file has 10 years trade one row per each day.
>
> These are the columns defined in the class
>
> case class columns(Stock: String, Ticker: String, TradeDate: String, Open:
> Float, High: Float, Low: Float, Close: Float, Volume: Integer)
>
> The issue is with Open, High, Low, Close columns that all are defined as
> Float.
>
> Most rows are OK like below but the red one with "-" defined as Float
> causes issues
>
>   Date     Open High  Low   Close Volume
> 27-Sep-16 80.91 80.93 79.87 80.85 1873158
> 23-Dec-11   -     -    -    40.56 0
>
> Because the prices are defined as Float, these rows cause the application
> to crash
> scala> val rs = df2.filter(changeToDate("TradeDate") >=
> monthsago).select((changeToDate("TradeDate").as("
> TradeDate")),(('Close+'Open)/2).as("AverageDailyPrice"), 'Low.as("Day's
> Low"), 'High.as("Day's High")).orderBy("TradeDate").collect
> 16/09/27 21:48:53 ERROR Executor: Exception in task 0.0 in stage 61.0 (TID
> 260)
> java.lang.NumberFormatException: For input string: "-"
>
>
> One way is to define the prices as Strings but that is not
> meaningful. Alternatively do the clean up before putting csv in HDFS but
> that becomes tedious and error prone.
>
> Any ideas will be appreciated.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>

Re: Issue with rogue data in csv file used in Spark application

Posted by Adrian Bridgett <ad...@opensignal.com>.

We use the spark-csv (a successor of which is built in to spark 2.0) for 
this.  It doesn't cause crashes, failed parsing is logged.   We run on 
Mesos so I have to pull back all the logs from all the executors and 
search for failed lines (so that we can ensure that the failure rate 
isn't too high).

Hope this helps.

Adrian


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Issue with rogue data in csv file used in Spark application

Posted by ayan guha <gu...@gmail.com>.

You can read as string, write a map to fix rows and then convert back to
your desired Dataframe.
On 28 Sep 2016 06:49, "Mich Talebzadeh" <mi...@gmail.com> wrote:

>
> I have historical prices for various stocks.
>
> Each csv file has 10 years trade one row per each day.
>
> These are the columns defined in the class
>
> case class columns(Stock: String, Ticker: String, TradeDate: String, Open:
> Float, High: Float, Low: Float, Close: Float, Volume: Integer)
>
> The issue is with Open, High, Low, Close columns that all are defined as
> Float.
>
> Most rows are OK like below but the red one with "-" defined as Float
> causes issues
>
>   Date     Open High  Low   Close Volume
> 27-Sep-16 80.91 80.93 79.87 80.85 1873158
> 23-Dec-11   -     -    -    40.56 0
>
> Because the prices are defined as Float, these rows cause the application
> to crash
> scala> val rs = df2.filter(changeToDate("TradeDate") >=
> monthsago).select((changeToDate("TradeDate").as("
> TradeDate")),(('Close+'Open)/2).as("AverageDailyPrice"), 'Low.as("Day's
> Low"), 'High.as("Day's High")).orderBy("TradeDate").collect
> 16/09/27 21:48:53 ERROR Executor: Exception in task 0.0 in stage 61.0 (TID
> 260)
> java.lang.NumberFormatException: For input string: "-"
>
>
> One way is to define the prices as Strings but that is not
> meaningful. Alternatively do the clean up before putting csv in HDFS but
> that becomes tedious and error prone.
>
> Any ideas will be appreciated.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>