You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Suneel Marthi (JIRA)" <ji...@apache.org> on 2015/05/31 06:39:17 UTC
[jira] [Updated] (MAHOUT-1700) OutOfMemory Problem in
ABtDenseOutJob in Distributed SSVD
[ https://issues.apache.org/jira/browse/MAHOUT-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Suneel Marthi updated MAHOUT-1700:
----------------------------------
Fix Version/s: (was: 0.10.1)
0.11.0
> OutOfMemory Problem in ABtDenseOutJob in Distributed SSVD
> ---------------------------------------------------------
>
> Key: MAHOUT-1700
> URL: https://issues.apache.org/jira/browse/MAHOUT-1700
> Project: Mahout
> Issue Type: Bug
> Components: Math
> Affects Versions: 0.9, 0.10.0
> Reporter: Ethan Yi
> Labels: patch
> Fix For: 0.11.0
>
>
> Recently, I tried mahout's hadoop ssvd(mahout-0.9 or mahout-1.0) job. There's a java heap space out of memory problem in ABtDenseOutJob. I found the reason, the ABtDenseOutJob map code is as below:
> protected void map(Writable key, VectorWritable value, Context context)
> throws IOException, InterruptedException {
> Vector vec = value.get();
> int vecSize = vec.size();
> if (aCols == null) {
> aCols = new Vector[vecSize];
> } else if (aCols.length < vecSize) {
> aCols = Arrays.copyOf(aCols, vecSize);
> }
> if (vec.isDense()) {
> for (int i = 0; i < vecSize; i++) {
> extendAColIfNeeded(i, aRowCount + 1);
> aCols[i].setQuick(aRowCount, vec.getQuick(i));
> }
> } else if (vec.size() > 0) {
> for (Vector.Element vecEl : vec.nonZeroes()) {
> int i = vecEl.index();
> extendAColIfNeeded(i, aRowCount + 1);
> aCols[i].setQuick(aRowCount, vecEl.get());
> }
> }
> aRowCount++;
> }
> If the input is RandomAccessSparseVector, usually with big data, it's vec.size() is Integer.MAX_VALUE, which is 2^31, then aCols = new Vector[vecSize] will introduce the OutOfMemory problem. The settlement of course should be enlarge every tasktracker's maximum memory:
> <property>
> <name>mapred.child.java.opts</name>
> <value>-Xmx1024m</value>
> </property>
> However, if you are NOT hadoop administrator or ops, you have no permission to modify the config. So, I try to modify ABtDenseOutJob map code to support RandomAccessSparseVector situation, I use hashmap to represent aCols instead of the original Vector[] aCols array, the modified code is as below:
> private Map<Integer, Vector> aColsMap = new HashMap<Integer, Vector>();
> protected void map(Writable key, VectorWritable value, Context context)
> throws IOException, InterruptedException {
> Vector vec = value.get();
> if (vec.isDense()) {
> for (int i = 0; i < vecSize; i++) {
> //extendAColIfNeeded(i, aRowCount + 1);
> if (aColsMap.get(i) == null) {
> aColsMap.put(i, new RandomAccessSparseVector(Integer.MAX_VALUE, 100));
> }
> aColsMap.get(i).setQuick(aRowCount, vec.getQuick(i));
> //aCols[i].setQuick(aRowCount, vec.getQuick(i));
> }
> } else if (vec.size() > 0) {
> for (Vector.Element vecEl : vec.nonZeroes()) {
> int i = vecEl.index();
> //extendAColIfNeeded(i, aRowCount + 1);
> if (aColsMap.get(i) == null) {
> aColsMap.put(i, new RandomAccessSparseVector(Integer.MAX_VALUE, 100));
> }
> aColsMap.get(i).setQuick(aRowCount, vecEl.get());
> //aCols[i].setQuick(aRowCount, vecEl.get());
> }
> }
> aRowCount++;
> }
> Then the OutofMemory problem is dismissed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)