You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Konstantin Kudryavtsev <ku...@gmail.com> on 2014/07/09 11:45:25 UTC
Filtering data during the read
Hi all,
I wondered if you could help me to clarify the next situation:
in the classic example
val file = spark.textFile("hdfs://...")
val errors = file.filter(line => line.contains("ERROR"))
As I understand, the data is read in memory in first, and after that
filtering is applying. Is it any way to apply filtering during the read
step? and don't put all objects into memory?
Thank you,
Konstantin Kudryavtsev
Re: Filtering data during the read
Posted by Mayur Rustagi <ma...@gmail.com>.
Hi,
Spark does that out of the box for you :)
It compresses down the execution steps as much as possible.
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>
On Wed, Jul 9, 2014 at 3:15 PM, Konstantin Kudryavtsev <
kudryavtsev.konstantin@gmail.com> wrote:
> Hi all,
>
> I wondered if you could help me to clarify the next situation:
> in the classic example
>
> val file = spark.textFile("hdfs://...")
> val errors = file.filter(line => line.contains("ERROR"))
>
> As I understand, the data is read in memory in first, and after that
> filtering is applying. Is it any way to apply filtering during the read
> step? and don't put all objects into memory?
>
> Thank you,
> Konstantin Kudryavtsev
>