You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by unk1102 <um...@gmail.com> on 2016/05/21 20:48:08 UTC

Does DataFrame has something like set hive.groupby.skewindata=true;

Hi I am having DataFrame with huge skew data in terms of TB and I am doing
groupby on 8 fields which I cant avoid unfortunately. I am looking to
optimize this I have found hive has

set hive.groupby.skewindata=true;

I dont use Hive I have Spark DataFrame can we achieve above Spark? Please
guide. Thanks in advance.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-DataFrame-has-something-like-set-hive-groupby-skewindata-true-tp26995.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Does DataFrame has something like set hive.groupby.skewindata=true;

Posted by Virgil Palanciuc <vi...@palanciuc.eu>.

It doesn't.
However, if you have a very large number of keys, with a small number of
very large keys, you can do one of the following:
A. Use a custom partitioner that counts the number of items in a key and
avoids putting large keys together; alternatively, if feasible (and
needed), include part of the value together with the key so that you can
split very large keys in multiple partitions (but that will likely alter
the way you need to do your computations).
B. Count by key, collect to the driver the keys with lots of values. Then
broadcast the set of the "large keys", and:
  i. filter-out the large keys, do your regular processing for the many
small keys
  ii. filter-out the small keys, do a special processing for the very large
keys, using the fact that you can probably store all  of them in memory at
any given point (e.g. thus you can key by value and do random-access to
retrieve "key data" for any given key, in the mapPartitions() code)

There is no universal solution for this problem, but these are good general
solutions that should hopefully set you on the right track.
(note: solution A works for RDDs only; solution B can work with DataFrame
too)

Regards,
Virgil.

On Sat, May 21, 2016 at 11:48 PM, unk1102 <um...@gmail.com> wrote:

> Hi I am having DataFrame with huge skew data in terms of TB and I am doing
> groupby on 8 fields which I cant avoid unfortunately. I am looking to
> optimize this I have found hive has
>
> set hive.groupby.skewindata=true;
>
> I dont use Hive I have Spark DataFrame can we achieve above Spark? Please
> guide. Thanks in advance.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Does-DataFrame-has-something-like-set-hive-groupby-skewindata-true-tp26995.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>