You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Haseeb <11...@seecs.edu.pk> on 2015/08/02 22:55:52 UTC

What are 'Buckets' referred in Spark Core code

Hi all ,
I am neebie trying to understand spark internals. There some entity referred
to as 'buckets' at many places in Spark Core code but I am having a hard
time what it is as it is just mentioned in code comments but I didn't come
across any data structure that reffered to it or any class for that matter.
I'd be really grateful if someone could shed some light on what exactly
buckets are and what is their functionally with respect to Spark internals.



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/What-are-Buckets-referred-in-Spark-Core-code-tp13557.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: What are 'Buckets' referred in Spark Core code

Posted by cheez <11...@seecs.edu.pk>.

Do we have a data structure that corresponds to buckets in Shuffle ? That is
of we wanted to explore the 'content' of these buckets in shuffle phase, can
we do that ? If yes, how ?



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/What-are-Buckets-referred-in-Spark-Core-code-tp13557p13559.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: What are 'Buckets' referred in Spark Core code

Posted by Reynold Xin <rx...@databricks.com>.

There are two usage of buckets used in Spark core.

The first usage is in histogram, used to perform sorting. Basically we
build an approximate histogram of the data in order to decide how to
partition the data in sorting. Each bucket is a range in the histogram.

The 2nd is used in shuffle, where we partition the output of each map task
into different "buckets", letting the reduce side fetching the map side
data based on their partition id.

On Sun, Aug 2, 2015 at 1:55 PM, Haseeb <11...@seecs.edu.pk> wrote:

> Hi all ,
> I am neebie trying to understand spark internals. There some entity
> referred
> to as 'buckets' at many places in Spark Core code but I am having a hard
> time what it is as it is just mentioned in code comments but I didn't come
> across any data structure that reffered to it or any class for that matter.
> I'd be really grateful if someone could shed some light on what exactly
> buckets are and what is their functionally with respect to Spark internals.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/What-are-Buckets-referred-in-Spark-Core-code-tp13557.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>