You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Yanwei Zhang <ac...@hotmail.com> on 2016/11/02 16:28:46 UTC

Use a specific partition of dataframe

Is it possible to retrieve a specific partition  (e.g., the first partition) of a DataFrame and apply some function there? My data is too large, and I just want to get some approximate measures using the first few partitions in the data. I'll illustrate what I want to accomplish using the example below:

// create date
val tmp = sc.parallelize(Seq( ("a", 1), ("b", 2), ("a", 1),
                                  ("b", 2), ("a", 1), ("b", 2)), 2).toDF("var", "value")
// I want to get the first partition only, and do some calculation, for example, count by the value of "var"
tmp1 = tmp.getPartition(0)
tmp1.groupBy("var").count()

The idea is not to go through all the data to save computational time. So I am not sure whether mapPartitionsWithIndex is helpful in this case, since it still maps all data.

Regards,
Wayne

RE: Use a specific partition of dataframe

Posted by "Mendelson, Assaf" <As...@rsa.com>.

There are a couple of tools you can use. Take a look at the various functions.
Specifically, limit might be useful for you and sample/sampleBy functions can make your data smaller.
Actually, when using CreateDataframe you can sample the data to begin with.

Specifically working by partitions can be done by moving through the RDD interface but I am not sure this is what you want. Actually working through a specific partition might mean seeing skewed data because of the hashing method used to partition (this would of course depend on how your dataframe was created).

Just to get smaller data sample/sampleBy seems like the best solution to me.

Assaf.

From: Yanwei Zhang [mailto:actuary_zhang@hotmail.com]
Sent: Wednesday, November 02, 2016 6:29 PM
To: user
Subject: Use a specific partition of dataframe

Is it possible to retrieve a specific partition  (e.g., the first partition) of a DataFrame and apply some function there? My data is too large, and I just want to get some approximate measures using the first few partitions in the data. I'll illustrate what I want to accomplish using the example below:

// create date
val tmp = sc.parallelize(Seq( ("a", 1), ("b", 2), ("a", 1),
                                  ("b", 2), ("a", 1), ("b", 2)), 2).toDF("var", "value")
// I want to get the first partition only, and do some calculation, for example, count by the value of "var"
tmp1 = tmp.getPartition(0)
tmp1.groupBy("var").count()

The idea is not to go through all the data to save computational time. So I am not sure whether mapPartitionsWithIndex is helpful in this case, since it still maps all data.

Regards,
Wayne