You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Konstantin Kudryavtsev <ku...@gmail.com> on 2014/07/09 11:45:25 UTC

Filtering data during the read

Hi all,

I wondered if you could help me to clarify the next situation:
in the classic example

val file = spark.textFile("hdfs://...")
val errors = file.filter(line => line.contains("ERROR"))

As I understand, the data is read in memory in first, and after that
filtering is applying. Is it any way to apply filtering during the read
step? and don't put all objects into memory?

Thank you,
Konstantin Kudryavtsev

Re: Filtering data during the read

Posted by Mayur Rustagi <ma...@gmail.com>.

Hi,
Spark does that out of the box for you :)
It compresses down the execution steps as much as possible.
Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Wed, Jul 9, 2014 at 3:15 PM, Konstantin Kudryavtsev <
kudryavtsev.konstantin@gmail.com> wrote:

> Hi all,
>
> I wondered if you could help me to clarify the next situation:
> in the classic example
>
> val file = spark.textFile("hdfs://...")
> val errors = file.filter(line => line.contains("ERROR"))
>
> As I understand, the data is read in memory in first, and after that
> filtering is applying. Is it any way to apply filtering during the read
> step? and don't put all objects into memory?
>
> Thank you,
> Konstantin Kudryavtsev
>