You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Joshi, Shrinivas" <Sh...@amd.com> on 2012/07/07 00:18:06 UTC

Potential regression in ASFEmail KMeans clustering

Just wanted to find out if this is a known/expected behavior with Mahout trunk. We are noticing that the KMeans iteration jobs that are part of the ASFEmail sample are taking longer to execute compared to Mahout 0.6 release. Using Mahout 0.6 release on the test cluster that we have, we see these jobs/steps taking not more than 6-7 minutes. However, with the  trunk code that I checked out few days back it is taking anywhere between 25mins to 50mins. Has anybody else seen something similar?

Thanks,
-Shrinivas

Re: Potential regression in ASFEmail KMeans clustering

Posted by Sean Owen <sr...@gmail.com>.
Oops, hit enter too early --

This changes

if (!foo) {
  doBar();
} else {
  doFoo();
}

to

if (foo) {
  doFoo();
} else {
  doBar();
}

(among other changes) for readability.

I imagine there could be a problem here but this isn't the change that did it.


On Wed, Jul 11, 2012 at 9:33 AM, Sean Owen <sr...@gmail.com> wrote:
> I made the change on the line in question, but it can't be the problem
> since it did not change the functionality. To see that you have to
> look at the rest of the change. It is changing...
>
> if (! ) {
>   A;
> } else {
>   B
> }
>
> to
>
> if (foo) {
>   B
>

Re: Potential regression in ASFEmail KMeans clustering

Posted by Sean Owen <sr...@gmail.com>.
I made the change on the line in question, but it can't be the problem
since it did not change the functionality. To see that you have to
look at the rest of the change. It is changing...

if (! ) {
  A;
} else {
  B
}

to

if (foo) {
  B

On Wed, Jul 11, 2012 at 3:17 AM, Joshi, Shrinivas
<Sh...@amd.com> wrote:
> It looks like this regression is caused by incorrect processing of IDFs. The difference I noticed between current trunk and Mahout 0.6 release related to this part of the code appears to be in core/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java file.
>
> I am not sure whether the changes in this class were part of a valid change or were accidental. Following patch seems to address the regression that we are seeing. Execution time of iteration jobs are now less than 3 mins on our test cluster. Without this change we see them taking as much as 1hr 20 mins.
>
> Let me know if I am missing something here.
>
> Index: core/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java
> ===================================================================
> --- core/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java (revision 1359759)
> +++ core/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java (working copy)
> @@ -268,7 +268,7 @@
>            ? DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER+"-toprune"
>            : DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER;
>
> -      if (processIdf) {
> +      if (!processIdf) {
>          DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
>                                                          outputDir,
>                                                          tfDirName,
>
> Thanks,
> -Shrinivas
>
> -----Original Message-----
> From: Joshi, Shrinivas [mailto:Shrinivas.Joshi@amd.com]
> Sent: Friday, July 06, 2012 5:18 PM
> To: dev@mahout.apache.org
> Subject: Potential regression in ASFEmail KMeans clustering
>
> Just wanted to find out if this is a known/expected behavior with Mahout trunk. We are noticing that the KMeans iteration jobs that are part of the ASFEmail sample are taking longer to execute compared to Mahout 0.6 release. Using Mahout 0.6 release on the test cluster that we have, we see these jobs/steps taking not more than 6-7 minutes. However, with the  trunk code that I checked out few days back it is taking anywhere between 25mins to 50mins. Has anybody else seen something similar?
>
> Thanks,
> -Shrinivas
>

RE: Potential regression in ASFEmail KMeans clustering

Posted by "Joshi, Shrinivas" <Sh...@amd.com>.
It looks like this regression is caused by incorrect processing of IDFs. The difference I noticed between current trunk and Mahout 0.6 release related to this part of the code appears to be in core/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java file.

I am not sure whether the changes in this class were part of a valid change or were accidental. Following patch seems to address the regression that we are seeing. Execution time of iteration jobs are now less than 3 mins on our test cluster. Without this change we see them taking as much as 1hr 20 mins.

Let me know if I am missing something here. 

Index: core/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java
===================================================================
--- core/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java (revision 1359759)
+++ core/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java (working copy)
@@ -268,7 +268,7 @@
           ? DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER+"-toprune"
           : DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER;

-      if (processIdf) {
+      if (!processIdf) {
         DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
                                                         outputDir,
                                                         tfDirName,

Thanks,
-Shrinivas

-----Original Message-----
From: Joshi, Shrinivas [mailto:Shrinivas.Joshi@amd.com] 
Sent: Friday, July 06, 2012 5:18 PM
To: dev@mahout.apache.org
Subject: Potential regression in ASFEmail KMeans clustering

Just wanted to find out if this is a known/expected behavior with Mahout trunk. We are noticing that the KMeans iteration jobs that are part of the ASFEmail sample are taking longer to execute compared to Mahout 0.6 release. Using Mahout 0.6 release on the test cluster that we have, we see these jobs/steps taking not more than 6-7 minutes. However, with the  trunk code that I checked out few days back it is taking anywhere between 25mins to 50mins. Has anybody else seen something similar?

Thanks,
-Shrinivas