You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Xuefu Zhang (JIRA)" <ji...@apache.org> on 2016/12/31 06:02:58 UTC
[jira] [Created] (HIVE-15527) Memory usage is unbound in
SortByShuffler for Spark
Xuefu Zhang created HIVE-15527:
----------------------------------
Summary: Memory usage is unbound in SortByShuffler for Spark
Key: HIVE-15527
URL: https://issues.apache.org/jira/browse/HIVE-15527
Project: Hive
Issue Type: Improvement
Components: Spark
Affects Versions: 1.1.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
In SortByShuffler.java, an ArrayList is used to back the iterator for values that have the same key in shuffled result produced by spark transformation sortByKey. It's possible that memory can be exhausted because of a large key group.
{code}
@Override
public Tuple2<HiveKey, Iterable<BytesWritable>> next() {
// TODO: implement this by accumulating rows with the same key into a list.
// Note that this list needs to improved to prevent excessive memory usage, but this
// can be done in later phase.
while (it.hasNext()) {
Tuple2<HiveKey, BytesWritable> pair = it.next();
if (curKey != null && !curKey.equals(pair._1())) {
HiveKey key = curKey;
List<BytesWritable> values = curValues;
curKey = pair._1();
curValues = new ArrayList<BytesWritable>();
curValues.add(pair._2());
return new Tuple2<HiveKey, Iterable<BytesWritable>>(key, values);
}
curKey = pair._1();
curValues.add(pair._2());
}
if (curKey == null) {
throw new NoSuchElementException();
}
// if we get here, this should be the last element we have
HiveKey key = curKey;
curKey = null;
return new Tuple2<HiveKey, Iterable<BytesWritable>>(key, curValues);
}
{code}
Since the output from sortByKey is already sorted on key, it's possible to backup the value iterable using the input iterator.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)