You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by si...@bt.com on 2013/03/09 16:44:50 UTC

Odd clustering fail

Hi there,

I am doing a fairly silly experiment to measure hadoop performance. As part of this I have extracted emails from the Enron database and I am clustering them using a proprietary method for clustering short messages (ie. tweets, emails, sms's) and benchmarking clusters in various configurations.

As part of this I have been benchmarking a single processing machine (my new laptop) this is a hp elite book with 32mb ram,sdds and nice processors ect ect, the point is that when explaining to people that we need hadoop I can show them that a laptop is really really useless and likely to remain so (I know this is obvious, come and work in a corporate and find out what else you have to do to earn a living! Then tell me that I am silly! )

Anyhooo... I have seen reasonable behaviours from the algorithms I have built (ie. for very small data map reduce puts an overhead on the processing, but once you get reasonably large the parallelism wins) but when I try with mahout's kmeans I get an odd behaviour.

When I get to ~175k individual files /175mb input data I get an exception

Exception in thread "main" java.lang.IllegalStateException: Job failed!
at org.apache.mahout.vectorizer.DictionaryVectorizer.makePartialVectors(DictionaryVectorizer.java:329)
at org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:199)
at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:271)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)

Is this because I am entirely inept and have missed something, or is this because of a limitation on mahout sequence files due to them not being aimed at loads of short messages that really can't be clustered anyway due to them having no information in them, hell?

Simon

----
Dr. Simon Thompson

RE: Odd clustering fail

Posted by si...@bt.com.

Hi Ted & Dan, 

thanks for coming back to me. I figured it out - I was being dumb! 

The short version is that I had a large file (a database results dump) that I was splitting with awk in a directory called enron-in.

On this occasion I had forgotten to remove it before running "mahout seqdirectory..." because there were/are 100k+ files in the directory after the split I didn't see my mistake and once I realised and removed it all was well. I realised that this might be the problem because previously I was having the same error due to SUCCESS files in a job output and had to turn writing SUCCESS out in the mahout config. 

Hope this helps people in the future. 

Simon

Dr. Simon Thompson
_________
From: Ted Dunning [ted.dunning@gmail.com]
Sent: 10 March 2013 17:54
To: user@mahout.apache.org
Subject: Re: Odd clustering fail

Simon,

The code that caused this error is this:

    boolean succeeded = job.waitForCompletion(true);
>     if (!succeeded) {
>       throw new IllegalStateException("Job failed!");
>     }

The reason that this is likely happening is that one of the tasks in your
program failed.  You probably need to look at the job tracker status page
to figure out which task, and then look at the logs for that task to find
the problem.

Also it is a bit confusing when you say:

As part of this I have been benchmarking a single processing machine (my
> new laptop) this is a hp
> elite book with 32mb ram,sdds and nice processors ect ect, the point is
> that when explaining to people
> that we need hadoop I can show them that a laptop is really really useless
> and likely to remain so (I
> know this is obvious, come and work in a corporate and find out what else
> you have to do to earn a
> living! Then tell me that I am silly! )

Did you actually mean 32GB?

If so, this isn't a useless machine at all.  It is quite possible that it
isn't up to all different tasks, but it should be plenty fast enough to get
lots of work done.

So as a point of corporate culture, you might have more mileage with your
arguments if you don't start with "this laptop is useless" on the way to
trying to prove "we need Hadoop".  In the first place, the first comment is
just silly and in the second place, the logical inference doesn't necessary
follow.  You might be more successful by posing an argument of the form
"this laptop can do X with the following limitations" AND "a hadoop cluster
of the following modest size can do Y with much less severe limitations"
AND "this is the business benefit of doing Y instead of X".

The real key is the third part of the argument.  If you can't make that
part of the argument, then none of the rest matters.  And, no, that part of
the argument is not obvious ... the business is currently working so what
is obvious is that the business can function without your suggested
approach.

Good luck.  You should try the new streaming k-means stuff that Dan is
working on as well.

On Sun, Mar 10, 2013 at 12:09 PM, Dan Filimon
<da...@gmail.com>wrote:

> Hi Simon,
>
> That looks like an error from the seq2sparse job you're using to
> vectorize the code.
> I think it's very surprising to get an error when vectorizing, but
> more others more experienced than me should probably comment. :)
>
> The line numbers don't match what I have in my version of Mahout (a
> forked version of trunk).
>
> If I'm not mistaken there should be an "inner" exception thrown by a
> mapper or reducer that tells us more. Can you please look through the
> error log and see if there's anything else?
>
> As a side note, I'm clustering the 20 newsgroups data set (~20K
> documents at ~20MB in total) and it's working fine.
>
> Thanks!
> Dan
>
> On Sat, Mar 9, 2013 at 5:44 PM,  <si...@bt.com> wrote:
> > Hi there,
> >
> > I am doing a fairly silly experiment to measure hadoop performance. As
> part of this I have extracted emails from the Enron database and I am
> clustering them using a proprietary method for clustering short messages
> (ie. tweets, emails, sms's) and benchmarking clusters in various
> configurations.
> >
> > As part of this I have been benchmarking a single processing machine (my
> new laptop) this is a hp elite book with 32mb ram,sdds and nice processors
> ect ect, the point is that when explaining to people that we need hadoop I
> can show them that a laptop is really really useless and likely to remain
> so (I know this is obvious, come and work in a corporate and find out what
> else you have to do to earn a living! Then tell me that I am silly! )
> >
> > Anyhooo...  I have seen reasonable behaviours from the algorithms I have
> built (ie. for very small data map reduce puts an overhead on the
> processing, but once you get reasonably large the parallelism wins) but
> when I try with mahout's kmeans I get an odd behaviour.
> >
> > When I get to ~175k individual files /175mb input data I get an exception
> >
> > Exception in thread "main" java.lang.IllegalStateException: Job failed!
> >         at
> org.apache.mahout.vectorizer.DictionaryVectorizer.makePartialVectors(DictionaryVectorizer.java:329)
> >         at
> org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:199)
> >         at
> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:271)
> >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> >
> > Is this because I am entirely inept and have missed something, or is
> this because of a limitation on mahout sequence files due to them not being
> aimed at loads of short messages that really can't be clustered anyway due
> to them having no information in them, hell?
> >
> > Simon
> >
> >
> >
> > ----
> > Dr. Simon Thompson
>

Re: Odd clustering fail

Posted by Ted Dunning <te...@gmail.com>.

Simon,

The code that caused this error is this:

    boolean succeeded = job.waitForCompletion(true);
>     if (!succeeded) {
>       throw new IllegalStateException("Job failed!");
>     }

The reason that this is likely happening is that one of the tasks in your
program failed.  You probably need to look at the job tracker status page
to figure out which task, and then look at the logs for that task to find
the problem.

Also it is a bit confusing when you say:

As part of this I have been benchmarking a single processing machine (my
> new laptop) this is a hp
> elite book with 32mb ram,sdds and nice processors ect ect, the point is
> that when explaining to people
> that we need hadoop I can show them that a laptop is really really useless
> and likely to remain so (I
> know this is obvious, come and work in a corporate and find out what else
> you have to do to earn a
> living! Then tell me that I am silly! )

Did you actually mean 32GB?

If so, this isn't a useless machine at all.  It is quite possible that it
isn't up to all different tasks, but it should be plenty fast enough to get
lots of work done.

So as a point of corporate culture, you might have more mileage with your
arguments if you don't start with "this laptop is useless" on the way to
trying to prove "we need Hadoop".  In the first place, the first comment is
just silly and in the second place, the logical inference doesn't necessary
follow.  You might be more successful by posing an argument of the form
"this laptop can do X with the following limitations" AND "a hadoop cluster
of the following modest size can do Y with much less severe limitations"
AND "this is the business benefit of doing Y instead of X".

The real key is the third part of the argument.  If you can't make that
part of the argument, then none of the rest matters.  And, no, that part of
the argument is not obvious ... the business is currently working so what
is obvious is that the business can function without your suggested
approach.

Good luck.  You should try the new streaming k-means stuff that Dan is
working on as well.

On Sun, Mar 10, 2013 at 12:09 PM, Dan Filimon
<da...@gmail.com>wrote:

> Hi Simon,
>
> That looks like an error from the seq2sparse job you're using to
> vectorize the code.
> I think it's very surprising to get an error when vectorizing, but
> more others more experienced than me should probably comment. :)
>
> The line numbers don't match what I have in my version of Mahout (a
> forked version of trunk).
>
> If I'm not mistaken there should be an "inner" exception thrown by a
> mapper or reducer that tells us more. Can you please look through the
> error log and see if there's anything else?
>
> As a side note, I'm clustering the 20 newsgroups data set (~20K
> documents at ~20MB in total) and it's working fine.
>
> Thanks!
> Dan
>
> On Sat, Mar 9, 2013 at 5:44 PM,  <si...@bt.com> wrote:
> > Hi there,
> >
> > I am doing a fairly silly experiment to measure hadoop performance. As
> part of this I have extracted emails from the Enron database and I am
> clustering them using a proprietary method for clustering short messages
> (ie. tweets, emails, sms's) and benchmarking clusters in various
> configurations.
> >
> > As part of this I have been benchmarking a single processing machine (my
> new laptop) this is a hp elite book with 32mb ram,sdds and nice processors
> ect ect, the point is that when explaining to people that we need hadoop I
> can show them that a laptop is really really useless and likely to remain
> so (I know this is obvious, come and work in a corporate and find out what
> else you have to do to earn a living! Then tell me that I am silly! )
> >
> > Anyhooo...  I have seen reasonable behaviours from the algorithms I have
> built (ie. for very small data map reduce puts an overhead on the
> processing, but once you get reasonably large the parallelism wins) but
> when I try with mahout's kmeans I get an odd behaviour.
> >
> > When I get to ~175k individual files /175mb input data I get an exception
> >
> > Exception in thread "main" java.lang.IllegalStateException: Job failed!
> >         at
> org.apache.mahout.vectorizer.DictionaryVectorizer.makePartialVectors(DictionaryVectorizer.java:329)
> >         at
> org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:199)
> >         at
> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:271)
> >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> >
> > Is this because I am entirely inept and have missed something, or is
> this because of a limitation on mahout sequence files due to them not being
> aimed at loads of short messages that really can't be clustered anyway due
> to them having no information in them, hell?
> >
> > Simon
> >
> >
> >
> > ----
> > Dr. Simon Thompson
>

Re: Odd clustering fail

Posted by Dan Filimon <da...@gmail.com>.

Hi Simon,

That looks like an error from the seq2sparse job you're using to
vectorize the code.
I think it's very surprising to get an error when vectorizing, but
more others more experienced than me should probably comment. :)

The line numbers don't match what I have in my version of Mahout (a
forked version of trunk).

If I'm not mistaken there should be an "inner" exception thrown by a
mapper or reducer that tells us more. Can you please look through the
error log and see if there's anything else?

As a side note, I'm clustering the 20 newsgroups data set (~20K
documents at ~20MB in total) and it's working fine.

Thanks!
Dan

On Sat, Mar 9, 2013 at 5:44 PM,  <si...@bt.com> wrote:
> Hi there,
>
> I am doing a fairly silly experiment to measure hadoop performance. As part of this I have extracted emails from the Enron database and I am clustering them using a proprietary method for clustering short messages (ie. tweets, emails, sms's) and benchmarking clusters in various configurations.
>
> As part of this I have been benchmarking a single processing machine (my new laptop) this is a hp elite book with 32mb ram,sdds and nice processors ect ect, the point is that when explaining to people that we need hadoop I can show them that a laptop is really really useless and likely to remain so (I know this is obvious, come and work in a corporate and find out what else you have to do to earn a living! Then tell me that I am silly! )
>
> Anyhooo...  I have seen reasonable behaviours from the algorithms I have built (ie. for very small data map reduce puts an overhead on the processing, but once you get reasonably large the parallelism wins) but when I try with mahout's kmeans I get an odd behaviour.
>
> When I get to ~175k individual files /175mb input data I get an exception
>
> Exception in thread "main" java.lang.IllegalStateException: Job failed!
>         at org.apache.mahout.vectorizer.DictionaryVectorizer.makePartialVectors(DictionaryVectorizer.java:329)
>         at org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:199)
>         at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:271)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>
> Is this because I am entirely inept and have missed something, or is this because of a limitation on mahout sequence files due to them not being aimed at loads of short messages that really can't be clustered anyway due to them having no information in them, hell?
>
> Simon
>
>
>
> ----
> Dr. Simon Thompson