You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Manoj Samel <ma...@gmail.com> on 2014/01/23 05:04:59 UTC

Handling occasional bad data ...

Hi,

How does spark handles following case?

Thousands of CSV files (each with 50MB size) comes from external system.
One RDD is defined on all of these. RDD defines some of the CSV fields as
BigDecimal etc. When building the RDD, it errors out saying bad BigDecimal
format after some time (error shows max retries 4).

1) It is very likely that massive dataset will have occasional bad rows. It
is not possible to fix this data set or do a pre-processing on it to
eliminate bad data. How does spark handles it? Is it possible to say ignore
first N bad rows etc. ?

2) What was the max 4 retries in error message? Any way to control it?

Thanks,

Re: Handling occasional bad data ...

Posted by Manoj Samel <ma...@gmail.com>.
Thanks Prashant


On Thu, Jan 23, 2014 at 5:00 AM, Prashant Sharma <sc...@gmail.com>wrote:

> spark.task.maxFailures
>  http://spark.incubator.apache.org/docs/latest/configuration.html
>
>
> On Thu, Jan 23, 2014 at 10:18 AM, Andrew Ash <an...@andrewash.com> wrote:
>
>> Why can't you preprocess to filter out the bad rows?  I often do this on
>> CSV files by testing if the raw line is "parseable" before splitting on ","
>> or similar.  Just validate the line before attempting to apply BigDecimal
>> or anything like that.
>>
>> Cheers,
>> Andrew
>>
>>
>> On Wed, Jan 22, 2014 at 9:04 PM, Manoj Samel <ma...@gmail.com>wrote:
>>
>>> Hi,
>>>
>>> How does spark handles following case?
>>>
>>> Thousands of CSV files (each with 50MB size) comes from external system.
>>> One RDD is defined on all of these. RDD defines some of the CSV fields as
>>> BigDecimal etc. When building the RDD, it errors out saying bad BigDecimal
>>> format after some time (error shows max retries 4).
>>>
>>> 1) It is very likely that massive dataset will have occasional bad rows.
>>> It is not possible to fix this data set or do a pre-processing on it to
>>> eliminate bad data. How does spark handles it? Is it possible to say ignore
>>> first N bad rows etc. ?
>>>
>>> 2) What was the max 4 retries in error message? Any way to control it?
>>>
>>> Thanks,
>>>
>>>
>>>
>>
>
>
> --
> Prashant
>

Re: Handling occasional bad data ...

Posted by Prashant Sharma <sc...@gmail.com>.
 spark.task.maxFailures
 http://spark.incubator.apache.org/docs/latest/configuration.html


On Thu, Jan 23, 2014 at 10:18 AM, Andrew Ash <an...@andrewash.com> wrote:

> Why can't you preprocess to filter out the bad rows?  I often do this on
> CSV files by testing if the raw line is "parseable" before splitting on ","
> or similar.  Just validate the line before attempting to apply BigDecimal
> or anything like that.
>
> Cheers,
> Andrew
>
>
> On Wed, Jan 22, 2014 at 9:04 PM, Manoj Samel <ma...@gmail.com>wrote:
>
>> Hi,
>>
>> How does spark handles following case?
>>
>> Thousands of CSV files (each with 50MB size) comes from external system.
>> One RDD is defined on all of these. RDD defines some of the CSV fields as
>> BigDecimal etc. When building the RDD, it errors out saying bad BigDecimal
>> format after some time (error shows max retries 4).
>>
>> 1) It is very likely that massive dataset will have occasional bad rows.
>> It is not possible to fix this data set or do a pre-processing on it to
>> eliminate bad data. How does spark handles it? Is it possible to say ignore
>> first N bad rows etc. ?
>>
>> 2) What was the max 4 retries in error message? Any way to control it?
>>
>> Thanks,
>>
>>
>>
>


-- 
Prashant

Re: Handling occasional bad data ...

Posted by Andrew Ash <an...@andrewash.com>.
Why can't you preprocess to filter out the bad rows?  I often do this on
CSV files by testing if the raw line is "parseable" before splitting on ","
or similar.  Just validate the line before attempting to apply BigDecimal
or anything like that.

Cheers,
Andrew


On Wed, Jan 22, 2014 at 9:04 PM, Manoj Samel <ma...@gmail.com>wrote:

> Hi,
>
> How does spark handles following case?
>
> Thousands of CSV files (each with 50MB size) comes from external system.
> One RDD is defined on all of these. RDD defines some of the CSV fields as
> BigDecimal etc. When building the RDD, it errors out saying bad BigDecimal
> format after some time (error shows max retries 4).
>
> 1) It is very likely that massive dataset will have occasional bad rows.
> It is not possible to fix this data set or do a pre-processing on it to
> eliminate bad data. How does spark handles it? Is it possible to say ignore
> first N bad rows etc. ?
>
> 2) What was the max 4 retries in error message? Any way to control it?
>
> Thanks,
>
>
>