You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Tim Chou <ti...@gmail.com> on 2014/11/13 23:59:46 UTC

Spark- How can I run MapReduce only on one partition in an RDD?

Hi All,

I use textFile to create a RDD. However, I don't want to handle the whole
data in this RDD. For example, maybe I only want to solve the data in 3rd
partition of the RDD.

How can I do it? Here are some possible solutions that I'm thinking:
1. Create multiple RDDs when reading the file
2.  Run MapReduce functions with the specific partition for an RDD.

However, I cannot find any appropriate function.

Thank you and wait for your suggestions.

Best,
Tim

RE: Spark- How can I run MapReduce only on one partition in an RDD?

Posted by "Ganelin, Ilya" <Il...@capitalone.com>.
Why do you only want the third partition? You can access individual partitions using the partitions() function. You can also filter your data using the filter() function to only contain the data you care about. Moreover, when you create your RDDs unless you define a custom partitioner you have no way of controlling what data is in partition #3. Therefore, there is almost no reason to want to operate on an individual partition.

-----Original Message-----
From: Tim Chou [timchou.hit@gmail.com<ma...@gmail.com>]
Sent: Thursday, November 13, 2014 06:01 PM Eastern Standard Time
To: user@spark.apache.org
Subject: Spark- How can I run MapReduce only on one partition in an RDD?

Hi All,

I use textFile to create a RDD. However, I don't want to handle the whole data in this RDD. For example, maybe I only want to solve the data in 3rd partition of the RDD.

How can I do it? Here are some possible solutions that I'm thinking:
1. Create multiple RDDs when reading the file
2.  Run MapReduce functions with the specific partition for an RDD.

However, I cannot find any appropriate function.

Thank you and wait for your suggestions.

Best,
Tim
________________________________________________________

The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed.  If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.

Re: Spark- How can I run MapReduce only on one partition in an RDD?

Posted by adrian <ad...@gmail.com>.
The direct answere you are looking for may be in RDD.mapPartitionsWithIndex()

The better question is, why are you looking into only the 3rd partition? To
analyze a random sample? Then look into RDD.sample(). Are you sure the data
you are looking for is in the 3rd partition? What if you end up with only 2
partitions after loading your data? Or you may want to filter() your RDD?

Adrian


Tim Chou wrote
> Hi All,
> 
> I use textFile to create a RDD. However, I don't want to handle the whole
> data in this RDD. For example, maybe I only want to solve the data in 3rd
> partition of the RDD.
> 
> How can I do it? Here are some possible solutions that I'm thinking:
> 1. Create multiple RDDs when reading the file
> 2.  Run MapReduce functions with the specific partition for an RDD.
> 
> However, I cannot find any appropriate function.
> 
> Thank you and wait for your suggestions.
> 
> Best,
> Tim





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-How-can-I-run-MapReduce-only-on-one-partition-in-an-RDD-tp18882p18884.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org