You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mohit Singh <mo...@gmail.com> on 2014/03/04 01:13:32 UTC

filter operation in pyspark

Hi,
   I have a csv file... (say "n" columns )

I am trying to do a filter operation like:

query = rdd.filter(lambda x:x[1] == "1234")
query.take(20)
Basically this would return me rows with that specific value?
This manipulation is taking quite some time to execute.. (if i can
compare.. maybe slower than hadoop operation..)

I am seeing this on my console:
14/03/03 16:13:03 INFO PythonRDD: Times: total = 5245, boot = 3, init = 8,
finish = 5234
14/03/03 16:13:03 INFO SparkContext: Job finished: take at <stdin>:1, took
5.249082169 s
14/03/03 16:13:03 INFO SparkContext: Starting job: take at <stdin>:1
14/03/03 16:13:03 INFO DAGScheduler: Got job 715 (take at <stdin>:1) with 1
output partitions (allowLocal=true)
14/03/03 16:13:03 INFO DAGScheduler: Final stage: Stage 720 (take at
<stdin>:1)
14/03/03 16:13:03 INFO DAGScheduler: Parents of final stage: List()
14/03/03 16:13:03 INFO DAGScheduler: Missing parents: List()
14/03/03 16:13:03 INFO DAGScheduler: Computing the requested partition
locally
14/03/03 16:13:03 INFO HadoopRDD: Input split:
hdfs://master:9000/user/hadoop/data/input.csv:5100273664+134217728

Am I not doing this correctly?


-- 
Mohit

"When you want success as badly as you want the air, then you will get it.
There is no other secret of success."
-Socrates

Re: filter operation in pyspark

Posted by Mayur Rustagi <ma...@gmail.com>.

Could be a number of issues.. maybe your csv is not allowing map tasks to
be broken, of the file is not process-node local.. how many tasks are you
seeing in spark web ui for map & store data. are all the nodes being used
when you look at task level .. is the time taken by each task roughly equal
or very skewed...
Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Mon, Mar 3, 2014 at 4:13 PM, Mohit Singh <mo...@gmail.com> wrote:

> Hi,
>    I have a csv file... (say "n" columns )
>
> I am trying to do a filter operation like:
>
> query = rdd.filter(lambda x:x[1] == "1234")
> query.take(20)
> Basically this would return me rows with that specific value?
> This manipulation is taking quite some time to execute.. (if i can
> compare.. maybe slower than hadoop operation..)
>
> I am seeing this on my console:
> 14/03/03 16:13:03 INFO PythonRDD: Times: total = 5245, boot = 3, init = 8,
> finish = 5234
> 14/03/03 16:13:03 INFO SparkContext: Job finished: take at <stdin>:1, took
> 5.249082169 s
> 14/03/03 16:13:03 INFO SparkContext: Starting job: take at <stdin>:1
> 14/03/03 16:13:03 INFO DAGScheduler: Got job 715 (take at <stdin>:1) with
> 1 output partitions (allowLocal=true)
> 14/03/03 16:13:03 INFO DAGScheduler: Final stage: Stage 720 (take at
> <stdin>:1)
> 14/03/03 16:13:03 INFO DAGScheduler: Parents of final stage: List()
> 14/03/03 16:13:03 INFO DAGScheduler: Missing parents: List()
> 14/03/03 16:13:03 INFO DAGScheduler: Computing the requested partition
> locally
> 14/03/03 16:13:03 INFO HadoopRDD: Input split:
> hdfs://master:9000/user/hadoop/data/input.csv:5100273664+134217728
>
> Am I not doing this correctly?
>
>
> --
> Mohit
>
> "When you want success as badly as you want the air, then you will get it.
> There is no other secret of success."
> -Socrates
>