You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Gerard Maas <ge...@gmail.com> on 2014/11/14 16:47:50 UTC

Re: Skipping Bad Records in Spark

You can combine map and filter in one operation using
collect(PartialFunction)  [1]

val cleanData = rawData.collect{case x  if (condition(x)) f(x) }

[1] **Not to be confused with the parameterless rdd.collect() that triggers
computations and delivers the results to the driver! **

PS: use the user@spark.apache.org for this kind of API usage discussion.
dev is mainly to discuss Spark internals.

On Fri, Nov 14, 2014 at 4:38 PM, Ganelin, Ilya <Il...@capitalone.com>
wrote:

> Hi Quizhuang - you have two options:
> 1) Within the map step define a validation function that will be executed
> on every record.
> 2) Use the filter function to create a filtered dataset prior to
> processing.
>
> On 11/14/14, 10:28 AM, "Qiuzhuang Lian" <qi...@gmail.com> wrote:
>
> >Hi,
> >
> >MapReduce has the feature of skipping bad records. Is there any equivalent
> >in Spark? Should I use filter API to do this?
> >
> >Thanks,
> >Qiuzhuang
>
> ________________________________________________________
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed.  If the reader of this message is not the
> intended recipient, you are hereby notified that any review,
> retransmission, dissemination, distribution, copying or other use of, or
> taking of any action in reliance upon this information is strictly
> prohibited. If you have received this communication in error, please
> contact the sender and delete the material from your computer.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>