You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Michael Mansour <Mi...@symantec.com> on 2018/04/30 23:09:09 UTC

Re: [EXT] [Spark 2.x Core] .collect() size limit

Well, if you don't need to actually evaluate the information on the driver, but just need to trigger some sort of action, then you may want to consider using the `forEach` or `forEachPartition` method, which is an action and will execute your process.  It won't return anything to the driver and blow out its memory.  For instance, I coalesce results to a smaller number of partitions, and use the forEachPartition to have the pipeline executed and results written via a custom DB connector. 

If you need to save the results of your job, then just write them to disk or a DB. 

Please expand on what you're trying to achieve here. 

-- 
Michael Mansour
Data Scientist 
Symantec CASB

On 4/28/18, 8:41 AM, "klrmowse" <kl...@gmail.com> wrote:

    i am currently trying to find a workaround for the Spark application i am
    working on so that it does not have to use .collect()
    
    but, for now, it is going to have to use .collect()
    
    what is the size limit (memory for the driver) of RDD file that .collect()
    can work with?
    
    i've been scouring google-search - S.O., blogs, etc, and everyone is
    cautioning about .collect(), but does not specify how huge is huge... are we
    talking about a few gigabytes? terabytes?? petabytes???
    
    
    
    thank you
    
    
    
    --
    Sent from: https://clicktime.symantec.com/a/1/G2S0jpsqRyjcQKz8BMD-57u9YHEnQ-h69cgkpqQng68=?d=IqkRB5PXtyU64zwtXRR_81Ek3npwXfp0028DVY5snG6r90rezWaDCQbQ6Ab6FaIZpLDoZ1mkTRdqgacrSYrOBcO36fn_bmwDGm_-jIFM3U6HZ4PkSqpSJY8WCddv-CS6OejjBpwbJ_ZkN4pQsVBX2Y9YDp_H9M4Lh-Up1XevC5eDghyQfz1_LBMjkcXQ64H2M2i8eatGFAaKR72rjCxcAncpHamquC2pYtUjN5LlYlDskvBBoTnw0Cna36sv61eEVCMTT6t3kxI0eZ1VNbwqAXRWEVo-N4rnn81K3y6bj47SfI5uS8pjba72sqtqVaC0s19cOYgqEnnRA-RR0KKBbNHEEGEXpsD2c0iLVyr-xNY7PBLnjCT3rBfdhkEPqFPfbEJO0oV-F6fTvWnr5KcN_g1dMEYaybaqQbogSA%3D%3D&u=http%3A%2F%2Fapache-spark-user-list.1001560.n3.nabble.com%2F
    
    ---------------------------------------------------------------------
    To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: [EXT] [Spark 2.x Core] .collect() size limit

Posted by klrmowse <kl...@gmail.com>.

okie, i may have found an alternate/workaround to using .collect() for what i
am trying to achieve...

initially, for the Spark application that i am working on, i would call
.collect() on two separate RDDs into a couple of ArrayLists (which was the
reason i was asking what the size limit on the driver is)

i need to map the 1st rdd to the 2nd rdd according to a computation/function
- resulting in key-value pairs;

it turns out, i don't need to call .collect() if i instead use
.zipPartitions() - to which i can pass the function to; 

i am currently testing it out...



thanks all for your responses



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org