You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Pedro Rodriguez <sk...@gmail.com> on 2016/06/18 04:32:49 UTC

Re: Skew data

I am going to take a guess that this means that your partitions within an
RDD are not balanced (one or more partitions are much larger than the
rest). This would mean a single core would need to do much more work than
the rest leading to poor performance. In general, the way to fix this is to
spread data across partitions evenly. In most cases calling repartition is
enough to solve the problem. If you have a special case you might need
create your own custom partitioner.

Pedro

On Thu, Jun 16, 2016 at 6:55 PM, Selvam Raman <se...@gmail.com> wrote:

> Hi,
>
> What is skew data.
>
> I read that if the data was skewed while joining it would take long time
> to finish the job.(99 percent finished in seconds where 1 percent of task
> taking minutes to hour).
>
> How to handle skewed data in spark.
>
> Thanks,
> Selvam R
> +91-97877-87724
>

-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience