You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Hui Wen Han (JIRA)" <ji...@apache.org> on 2010/08/12 16:08:21 UTC

[jira] Created: (MAHOUT-468) Performance of RowSimilarityJob is not good

Performance of RowSimilarityJob is not good
-------------------------------------------

                 Key: MAHOUT-468
                 URL: https://issues.apache.org/jira/browse/MAHOUT-468
             Project: Mahout
          Issue Type: Test
          Components: Collaborative Filtering
    Affects Versions: 0.4
            Reporter: Hui Wen Han
             Fix For: 0.4


I have done a test ,

Preferences records: 680,194
distinct users: 23,246
distinct items:437,569 
SIMILARITY_CLASS_NAME=SIMILARITY_COOCCURRENCE

maybePruneItemUserMatrixPath:16.50M
weights:13.80M
pairwiseSimilarity:18.81G
Job RowSimilarityJob-RowWeightMapper-WeightedOccurrencesPerColumnReducer:used 32 sec
Job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer:used 4.30 hours


I think the reason may be following:
1) We used SequenceFileOutputFormat,it cause job can only be run by n ( n= Hadoop node counts ) mappers or reducers concurrently.
2)  We stored redundant info.

for example :

the output of CooccurrencesMapper: (ItemIndexA,similarity),(ItemIndexA,ItemIndexB,similarity)

3) Some frequently used code 
https://issues.apache.org/jira/browse/MAHOUT-467

4) allocate many local variable in loop (need confirm )

In Class DistributedUncenteredZeroAssumingCosineVectorSimilarity

  @Override
  public double weight(Vector v) {
    double length = 0.0;
    Iterator<Element> elemIterator = v.iterateNonZero();
    while (elemIterator.hasNext()) {
      double value = elemIterator.next().get();  //this one
      length += value * value;
    }
    return Math.sqrt(length);
  }



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-468) Performance of RowSimilarityJob is not good

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898067#action_12898067 ] 

Ted Dunning commented on MAHOUT-468:
------------------------------------

This screen shot shows a single mapper and single reducer running.  Is this really running on a hadoop cluster?  Or is it running in local mode?

Secondly, it shows no combiner action at all.  That seems implausible.



> Performance of RowSimilarityJob is not good
> -------------------------------------------
>
>                 Key: MAHOUT-468
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-468
>             Project: Mahout
>          Issue Type: Test
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Han Hui Wen 
>             Fix For: 0.4
>
>         Attachments: RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.jpg
>
>
> I have done a test ,
> Preferences records: 680,194
> distinct users: 23,246
> distinct items:437,569 
> SIMILARITY_CLASS_NAME=SIMILARITY_COOCCURRENCE
> maybePruneItemUserMatrixPath:16.50M
> weights:13.80M
> pairwiseSimilarity:18.81G
> Job RowSimilarityJob-RowWeightMapper-WeightedOccurrencesPerColumnReducer:used 32 sec
> Job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer:used 4.30 hours
> I think the reason may be following:
> 1) We used SequenceFileOutputFormat,it cause job can only be run by n ( n= Hadoop node counts ) mappers or reducers concurrently.
> 2)  We stored redundant info.
> for example :
> the output of CooccurrencesMapper: (ItemIndexA,similarity),(ItemIndexA,ItemIndexB,similarity)
> 3) Some frequently used code 
> https://issues.apache.org/jira/browse/MAHOUT-467
> 4) allocate many local variable in loop (need confirm )
> In Class DistributedUncenteredZeroAssumingCosineVectorSimilarity
>   @Override
>   public double weight(Vector v) {
>     double length = 0.0;
>     Iterator<Element> elemIterator = v.iterateNonZero();
>     while (elemIterator.hasNext()) {
>       double value = elemIterator.next().get();  //this one
>       length += value * value;
>     }
>     return Math.sqrt(length);
>   }
> 5) Maybe we need control the size of cooccurrences

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (MAHOUT-468) Performance of RowSimilarityJob is not good

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved MAHOUT-468.
------------------------------

    Fix Version/s:     (was: 0.4)
       Resolution: Not A Problem

Since Sebastian is tracking #5 in MAHOUT-460, and I am not sure that the other items are actual performance issues, I'm going to close this.  If this can be re-phrased to concern a specific performance issue that has been reasonably confirmed, and ideally a patch, we could reopen (or just make a new JIRA).

> Performance of RowSimilarityJob is not good
> -------------------------------------------
>
>                 Key: MAHOUT-468
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-468
>             Project: Mahout
>          Issue Type: Test
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Han Hui Wen 
>         Attachments: RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.jpg
>
>
> I have done a test ,
> Preferences records: 680,194
> distinct users: 23,246
> distinct items:437,569 
> SIMILARITY_CLASS_NAME=SIMILARITY_COOCCURRENCE
> maybePruneItemUserMatrixPath:16.50M
> weights:13.80M
> pairwiseSimilarity:18.81G
> Job RowSimilarityJob-RowWeightMapper-WeightedOccurrencesPerColumnReducer:used 32 sec
> Job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer:used 4.30 hours
> I think the reason may be following:
> 1) We used SequenceFileOutputFormat,it cause job can only be run by n ( n= Hadoop node counts ) mappers or reducers concurrently.
> 2)  We stored redundant info.
> for example :
> the output of CooccurrencesMapper: (ItemIndexA,similarity),(ItemIndexA,ItemIndexB,similarity)
> 3) Some frequently used code 
> https://issues.apache.org/jira/browse/MAHOUT-467
> 4) allocate many local variable in loop (need confirm )
> In Class DistributedUncenteredZeroAssumingCosineVectorSimilarity
>   @Override
>   public double weight(Vector v) {
>     double length = 0.0;
>     Iterator<Element> elemIterator = v.iterateNonZero();
>     while (elemIterator.hasNext()) {
>       double value = elemIterator.next().get();  //this one
>       length += value * value;
>     }
>     return Math.sqrt(length);
>   }
> 5) Maybe we need control the size of cooccurrences

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-468) Performance of RowSimilarityJob is not good

Posted by "Hui Wen Han (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hui Wen Han updated MAHOUT-468:
-------------------------------

    Description: 
I have done a test ,

Preferences records: 680,194
distinct users: 23,246
distinct items:437,569 
SIMILARITY_CLASS_NAME=SIMILARITY_COOCCURRENCE

maybePruneItemUserMatrixPath:16.50M
weights:13.80M
pairwiseSimilarity:18.81G
Job RowSimilarityJob-RowWeightMapper-WeightedOccurrencesPerColumnReducer:used 32 sec
Job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer:used 4.30 hours


I think the reason may be following:
1) We used SequenceFileOutputFormat,it cause job can only be run by n ( n= Hadoop node counts ) mappers or reducers concurrently.
2)  We stored redundant info.

for example :

the output of CooccurrencesMapper: (ItemIndexA,similarity),(ItemIndexA,ItemIndexB,similarity)

3) Some frequently used code 
https://issues.apache.org/jira/browse/MAHOUT-467

4) allocate many local variable in loop (need confirm )

In Class DistributedUncenteredZeroAssumingCosineVectorSimilarity

  @Override
  public double weight(Vector v) {
    double length = 0.0;

    Iterator<Element> elemIterator = v.iterateNonZero();

    while (elemIterator.hasNext()) {

      double value = elemIterator.next().get();  //this one

      length += value * value;

    }
    return Math.sqrt(length);
  }

5) Maybe we need control the size of cooccurrences


  was:
I have done a test ,

Preferences records: 680,194
distinct users: 23,246
distinct items:437,569 
SIMILARITY_CLASS_NAME=SIMILARITY_COOCCURRENCE

maybePruneItemUserMatrixPath:16.50M
weights:13.80M
pairwiseSimilarity:18.81G
Job RowSimilarityJob-RowWeightMapper-WeightedOccurrencesPerColumnReducer:used 32 sec
Job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer:used 4.30 hours


I think the reason may be following:
1) We used SequenceFileOutputFormat,it cause job can only be run by n ( n= Hadoop node counts ) mappers or reducers concurrently.
2)  We stored redundant info.

for example :

the output of CooccurrencesMapper: (ItemIndexA,similarity),(ItemIndexA,ItemIndexB,similarity)

3) Some frequently used code 
https://issues.apache.org/jira/browse/MAHOUT-467

4) allocate many local variable in loop (need confirm )

In Class DistributedUncenteredZeroAssumingCosineVectorSimilarity

  @Override
  public double weight(Vector v) {
    double length = 0.0;
    Iterator<Element> elemIterator = v.iterateNonZero();
    while (elemIterator.hasNext()) {
      double value = elemIterator.next().get();  //this one
      length += value * value;
    }
    return Math.sqrt(length);
  }




> Performance of RowSimilarityJob is not good
> -------------------------------------------
>
>                 Key: MAHOUT-468
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-468
>             Project: Mahout
>          Issue Type: Test
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Hui Wen Han
>             Fix For: 0.4
>
>
> I have done a test ,
> Preferences records: 680,194
> distinct users: 23,246
> distinct items:437,569 
> SIMILARITY_CLASS_NAME=SIMILARITY_COOCCURRENCE
> maybePruneItemUserMatrixPath:16.50M
> weights:13.80M
> pairwiseSimilarity:18.81G
> Job RowSimilarityJob-RowWeightMapper-WeightedOccurrencesPerColumnReducer:used 32 sec
> Job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer:used 4.30 hours
> I think the reason may be following:
> 1) We used SequenceFileOutputFormat,it cause job can only be run by n ( n= Hadoop node counts ) mappers or reducers concurrently.
> 2)  We stored redundant info.
> for example :
> the output of CooccurrencesMapper: (ItemIndexA,similarity),(ItemIndexA,ItemIndexB,similarity)
> 3) Some frequently used code 
> https://issues.apache.org/jira/browse/MAHOUT-467
> 4) allocate many local variable in loop (need confirm )
> In Class DistributedUncenteredZeroAssumingCosineVectorSimilarity
>   @Override
>   public double weight(Vector v) {
>     double length = 0.0;
>     Iterator<Element> elemIterator = v.iterateNonZero();
>     while (elemIterator.hasNext()) {
>       double value = elemIterator.next().get();  //this one
>       length += value * value;
>     }
>     return Math.sqrt(length);
>   }
> 5) Maybe we need control the size of cooccurrences

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-468) Performance of RowSimilarityJob is not good

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897930#action_12897930 ] 

Ted Dunning commented on MAHOUT-468:
------------------------------------

{quote}
1) We used SequenceFileOutputFormat,it cause job can only be run by n ( n= Hadoop node counts ) mappers or reducers concurrently.
{quote}
Sequence files are splittable.  This isn't the problem.

> Performance of RowSimilarityJob is not good
> -------------------------------------------
>
>                 Key: MAHOUT-468
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-468
>             Project: Mahout
>          Issue Type: Test
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Hui Wen Han
>             Fix For: 0.4
>
>
> I have done a test ,
> Preferences records: 680,194
> distinct users: 23,246
> distinct items:437,569 
> SIMILARITY_CLASS_NAME=SIMILARITY_COOCCURRENCE
> maybePruneItemUserMatrixPath:16.50M
> weights:13.80M
> pairwiseSimilarity:18.81G
> Job RowSimilarityJob-RowWeightMapper-WeightedOccurrencesPerColumnReducer:used 32 sec
> Job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer:used 4.30 hours
> I think the reason may be following:
> 1) We used SequenceFileOutputFormat,it cause job can only be run by n ( n= Hadoop node counts ) mappers or reducers concurrently.
> 2)  We stored redundant info.
> for example :
> the output of CooccurrencesMapper: (ItemIndexA,similarity),(ItemIndexA,ItemIndexB,similarity)
> 3) Some frequently used code 
> https://issues.apache.org/jira/browse/MAHOUT-467
> 4) allocate many local variable in loop (need confirm )
> In Class DistributedUncenteredZeroAssumingCosineVectorSimilarity
>   @Override
>   public double weight(Vector v) {
>     double length = 0.0;
>     Iterator<Element> elemIterator = v.iterateNonZero();
>     while (elemIterator.hasNext()) {
>       double value = elemIterator.next().get();  //this one
>       length += value * value;
>     }
>     return Math.sqrt(length);
>   }
> 5) Maybe we need control the size of cooccurrences

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-468) Performance of RowSimilarityJob is not good

Posted by "Han Hui Wen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Han Hui Wen  updated MAHOUT-468:
--------------------------------

    Attachment: RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.jpg

RowSimilarityJob-CooccurrencesMapper-SimilarityReducer

> Performance of RowSimilarityJob is not good
> -------------------------------------------
>
>                 Key: MAHOUT-468
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-468
>             Project: Mahout
>          Issue Type: Test
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Han Hui Wen 
>             Fix For: 0.4
>
>         Attachments: RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.jpg
>
>
> I have done a test ,
> Preferences records: 680,194
> distinct users: 23,246
> distinct items:437,569 
> SIMILARITY_CLASS_NAME=SIMILARITY_COOCCURRENCE
> maybePruneItemUserMatrixPath:16.50M
> weights:13.80M
> pairwiseSimilarity:18.81G
> Job RowSimilarityJob-RowWeightMapper-WeightedOccurrencesPerColumnReducer:used 32 sec
> Job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer:used 4.30 hours
> I think the reason may be following:
> 1) We used SequenceFileOutputFormat,it cause job can only be run by n ( n= Hadoop node counts ) mappers or reducers concurrently.
> 2)  We stored redundant info.
> for example :
> the output of CooccurrencesMapper: (ItemIndexA,similarity),(ItemIndexA,ItemIndexB,similarity)
> 3) Some frequently used code 
> https://issues.apache.org/jira/browse/MAHOUT-467
> 4) allocate many local variable in loop (need confirm )
> In Class DistributedUncenteredZeroAssumingCosineVectorSimilarity
>   @Override
>   public double weight(Vector v) {
>     double length = 0.0;
>     Iterator<Element> elemIterator = v.iterateNonZero();
>     while (elemIterator.hasNext()) {
>       double value = elemIterator.next().get();  //this one
>       length += value * value;
>     }
>     return Math.sqrt(length);
>   }
> 5) Maybe we need control the size of cooccurrences

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-468) Performance of RowSimilarityJob is not good

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897934#action_12897934 ] 

Sean Owen commented on MAHOUT-468:
----------------------------------

Yes, there is only positive effect to #4 and it should not be changed. There is no "allocation" of stack variables at runtime in Java. The alternative, to save the result of next() and then call get() twice is definitely slower.

What's left as the concrete issue here? MAHOUT-467 is separate. #1 should be OK. The redundant info in #2 doesn't seem like a big sin.

> Performance of RowSimilarityJob is not good
> -------------------------------------------
>
>                 Key: MAHOUT-468
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-468
>             Project: Mahout
>          Issue Type: Test
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Hui Wen Han
>             Fix For: 0.4
>
>
> I have done a test ,
> Preferences records: 680,194
> distinct users: 23,246
> distinct items:437,569 
> SIMILARITY_CLASS_NAME=SIMILARITY_COOCCURRENCE
> maybePruneItemUserMatrixPath:16.50M
> weights:13.80M
> pairwiseSimilarity:18.81G
> Job RowSimilarityJob-RowWeightMapper-WeightedOccurrencesPerColumnReducer:used 32 sec
> Job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer:used 4.30 hours
> I think the reason may be following:
> 1) We used SequenceFileOutputFormat,it cause job can only be run by n ( n= Hadoop node counts ) mappers or reducers concurrently.
> 2)  We stored redundant info.
> for example :
> the output of CooccurrencesMapper: (ItemIndexA,similarity),(ItemIndexA,ItemIndexB,similarity)
> 3) Some frequently used code 
> https://issues.apache.org/jira/browse/MAHOUT-467
> 4) allocate many local variable in loop (need confirm )
> In Class DistributedUncenteredZeroAssumingCosineVectorSimilarity
>   @Override
>   public double weight(Vector v) {
>     double length = 0.0;
>     Iterator<Element> elemIterator = v.iterateNonZero();
>     while (elemIterator.hasNext()) {
>       double value = elemIterator.next().get();  //this one
>       length += value * value;
>     }
>     return Math.sqrt(length);
>   }
> 5) Maybe we need control the size of cooccurrences

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-468) Performance of RowSimilarityJob is not good

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897924#action_12897924 ] 

Ted Dunning commented on MAHOUT-468:
------------------------------------

4) allocate many local variable in loop (need confirm )

In Class DistributedUncenteredZeroAssumingCosineVectorSimilarity
{code}
@Override
public double weight(Vector v) {
double length = 0.0;

Iterator<Element> elemIterator = v.iterateNonZero();

while (elemIterator.hasNext()) { 
    double value = elemIterator.next().get(); //this one
    length += value * value; }
    return Math.sqrt(length);
}
{code}
In fact the "allocation" of value here will have only a positive effect.  In general, primitives allocated in a small scope will be handled by the JIT in about as optimal a fashion as is possible.  Even heap-allocated structures in a tight loop do not usually cause any measurable inefficiency (I have tested this recently).  In some cases, re-using a structure in a tight loop will be *slower* than allocating new objects because it will cause the reused structure to survive into the tenured generation while the multiple allocations will simply cause a small amount of churn in the eden space. 

> Performance of RowSimilarityJob is not good
> -------------------------------------------
>
>                 Key: MAHOUT-468
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-468
>             Project: Mahout
>          Issue Type: Test
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Hui Wen Han
>             Fix For: 0.4
>
>
> I have done a test ,
> Preferences records: 680,194
> distinct users: 23,246
> distinct items:437,569 
> SIMILARITY_CLASS_NAME=SIMILARITY_COOCCURRENCE
> maybePruneItemUserMatrixPath:16.50M
> weights:13.80M
> pairwiseSimilarity:18.81G
> Job RowSimilarityJob-RowWeightMapper-WeightedOccurrencesPerColumnReducer:used 32 sec
> Job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer:used 4.30 hours
> I think the reason may be following:
> 1) We used SequenceFileOutputFormat,it cause job can only be run by n ( n= Hadoop node counts ) mappers or reducers concurrently.
> 2)  We stored redundant info.
> for example :
> the output of CooccurrencesMapper: (ItemIndexA,similarity),(ItemIndexA,ItemIndexB,similarity)
> 3) Some frequently used code 
> https://issues.apache.org/jira/browse/MAHOUT-467
> 4) allocate many local variable in loop (need confirm )
> In Class DistributedUncenteredZeroAssumingCosineVectorSimilarity
>   @Override
>   public double weight(Vector v) {
>     double length = 0.0;
>     Iterator<Element> elemIterator = v.iterateNonZero();
>     while (elemIterator.hasNext()) {
>       double value = elemIterator.next().get();  //this one
>       length += value * value;
>     }
>     return Math.sqrt(length);
>   }
> 5) Maybe we need control the size of cooccurrences

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-468) Performance of RowSimilarityJob is not good

Posted by "Han Hui Wen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Han Hui Wen  updated MAHOUT-468:
--------------------------------

    Comment: was deleted

(was: RowSimilarityJob-CooccurrencesMapper-SimilarityReducer)

> Performance of RowSimilarityJob is not good
> -------------------------------------------
>
>                 Key: MAHOUT-468
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-468
>             Project: Mahout
>          Issue Type: Test
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Han Hui Wen 
>             Fix For: 0.4
>
>         Attachments: RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.jpg
>
>
> I have done a test ,
> Preferences records: 680,194
> distinct users: 23,246
> distinct items:437,569 
> SIMILARITY_CLASS_NAME=SIMILARITY_COOCCURRENCE
> maybePruneItemUserMatrixPath:16.50M
> weights:13.80M
> pairwiseSimilarity:18.81G
> Job RowSimilarityJob-RowWeightMapper-WeightedOccurrencesPerColumnReducer:used 32 sec
> Job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer:used 4.30 hours
> I think the reason may be following:
> 1) We used SequenceFileOutputFormat,it cause job can only be run by n ( n= Hadoop node counts ) mappers or reducers concurrently.
> 2)  We stored redundant info.
> for example :
> the output of CooccurrencesMapper: (ItemIndexA,similarity),(ItemIndexA,ItemIndexB,similarity)
> 3) Some frequently used code 
> https://issues.apache.org/jira/browse/MAHOUT-467
> 4) allocate many local variable in loop (need confirm )
> In Class DistributedUncenteredZeroAssumingCosineVectorSimilarity
>   @Override
>   public double weight(Vector v) {
>     double length = 0.0;
>     Iterator<Element> elemIterator = v.iterateNonZero();
>     while (elemIterator.hasNext()) {
>       double value = elemIterator.next().get();  //this one
>       length += value * value;
>     }
>     return Math.sqrt(length);
>   }
> 5) Maybe we need control the size of cooccurrences

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-468) Performance of RowSimilarityJob is not good

Posted by "Han Hui Wen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898088#action_12898088 ] 

Han Hui Wen  commented on MAHOUT-468:
-------------------------------------

Please see 
https://issues.apache.org/jira/browse/MAHOUT-473

> Performance of RowSimilarityJob is not good
> -------------------------------------------
>
>                 Key: MAHOUT-468
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-468
>             Project: Mahout
>          Issue Type: Test
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Han Hui Wen 
>             Fix For: 0.4
>
>         Attachments: RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.jpg
>
>
> I have done a test ,
> Preferences records: 680,194
> distinct users: 23,246
> distinct items:437,569 
> SIMILARITY_CLASS_NAME=SIMILARITY_COOCCURRENCE
> maybePruneItemUserMatrixPath:16.50M
> weights:13.80M
> pairwiseSimilarity:18.81G
> Job RowSimilarityJob-RowWeightMapper-WeightedOccurrencesPerColumnReducer:used 32 sec
> Job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer:used 4.30 hours
> I think the reason may be following:
> 1) We used SequenceFileOutputFormat,it cause job can only be run by n ( n= Hadoop node counts ) mappers or reducers concurrently.
> 2)  We stored redundant info.
> for example :
> the output of CooccurrencesMapper: (ItemIndexA,similarity),(ItemIndexA,ItemIndexB,similarity)
> 3) Some frequently used code 
> https://issues.apache.org/jira/browse/MAHOUT-467
> 4) allocate many local variable in loop (need confirm )
> In Class DistributedUncenteredZeroAssumingCosineVectorSimilarity
>   @Override
>   public double weight(Vector v) {
>     double length = 0.0;
>     Iterator<Element> elemIterator = v.iterateNonZero();
>     while (elemIterator.hasNext()) {
>       double value = elemIterator.next().get();  //this one
>       length += value * value;
>     }
>     return Math.sqrt(length);
>   }
> 5) Maybe we need control the size of cooccurrences

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-468) Performance of RowSimilarityJob is not good

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897826#action_12897826 ] 

Sebastian Schelter commented on MAHOUT-468:
-------------------------------------------

I think you're addressing this issue here: https://issues.apache.org/jira/browse/MAHOUT-460

> Performance of RowSimilarityJob is not good
> -------------------------------------------
>
>                 Key: MAHOUT-468
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-468
>             Project: Mahout
>          Issue Type: Test
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Hui Wen Han
>             Fix For: 0.4
>
>
> I have done a test ,
> Preferences records: 680,194
> distinct users: 23,246
> distinct items:437,569 
> SIMILARITY_CLASS_NAME=SIMILARITY_COOCCURRENCE
> maybePruneItemUserMatrixPath:16.50M
> weights:13.80M
> pairwiseSimilarity:18.81G
> Job RowSimilarityJob-RowWeightMapper-WeightedOccurrencesPerColumnReducer:used 32 sec
> Job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer:used 4.30 hours
> I think the reason may be following:
> 1) We used SequenceFileOutputFormat,it cause job can only be run by n ( n= Hadoop node counts ) mappers or reducers concurrently.
> 2)  We stored redundant info.
> for example :
> the output of CooccurrencesMapper: (ItemIndexA,similarity),(ItemIndexA,ItemIndexB,similarity)
> 3) Some frequently used code 
> https://issues.apache.org/jira/browse/MAHOUT-467
> 4) allocate many local variable in loop (need confirm )
> In Class DistributedUncenteredZeroAssumingCosineVectorSimilarity
>   @Override
>   public double weight(Vector v) {
>     double length = 0.0;
>     Iterator<Element> elemIterator = v.iterateNonZero();
>     while (elemIterator.hasNext()) {
>       double value = elemIterator.next().get();  //this one
>       length += value * value;
>     }
>     return Math.sqrt(length);
>   }
> 5) Maybe we need control the size of cooccurrences

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.