You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Jake Mannix <ja...@gmail.com> on 2011/05/08 18:08:48 UTC

The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Running on the cluster, I'm hit with this again, on using a freshly built
distribution (with mvn package -Prelease).  What is the solution we always
give people to deal with this?

(just running things like "./bin/mahout svd -i <input> -o <output> etc... ")

  -jake

Re: Understanding log-likelihood

Posted by Ted Dunning <te...@gmail.com>.
Try this:

http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html

On Sun, May 8, 2011 at 5:14 PM, Thomas Söhngen <th...@beluto.com> wrote:

> Hello,
>
> I struggle to understand the log-likelihood function. I would highly
> welcome a simple example of how it is calculated, especially in Mahout.
>
> Thanks in advance,
> Thomas
>

Re: Understanding log-likelihood

Posted by Sean Owen <sr...@gmail.com>.
I can try to explain one understanding of the meaning, though it is
not really the intuitive explanation of the formulation in Mahout,
rather a somewhat different one I originally used. And even 80%
understand.

Two users are similar when they rate or are associated to many of the
same items. However, a certain overlap may or may not be meaningful --
it could be due to chance, or due to the fact that we have similar
tastes. For example if you and I have rated 100 items each, and 50
overlap, we're probably similar. But if we've each rated 1000 and
overlap in only 50, maybe we're not.

The log-likelihood metric is just trying to formally quantify how
unlikely it is that our overlap is due to chance. The less likely, the
more similar we are.

So it is comparing two likelihoods, and just looking at their ratio.
The numerator likelihood is the null hypothesis: we're not similar and
overlap is due to chance. The denominator is the likelihood that it's
not at all due to chance -- that the overlap is completely explained
is perfectly explained because our tastes are similar and the overlap
is exactly what you'd expect given that.

When the numerator is relatively small, the null hypothesis is
relatively unlikely, so we are similar.

The reason the formulation typically then takes -2.0 * log (likelihood
ratio) is by convention, and it makes the result a bit more useful in
two ways. One, more similarity will equal a higher log-likelihood,
which is perhaps more intuitive than the likelihood ratio which is
lower when similarity is higher. But the real reason is that the
log-likelhood value then follows a chi-squared distribution and the
result can be used to actually figure a probability that the users are
similar or not. (We don't use it that way in Mahout though.)

And Ted' formulation, which is also right and quite tidy and the one
used in the project, is based on Shannon entropy. I understand it, I
believe, but would have to think more about an intuitive explanation.

It is, similarly, trying to figure out whether the co-occurrences are
"unusually" frequent by asking whether there is any additional
information to be gained by looking at user 1 and user 2's preferences
separately versus everything at once. If there is, then there is
something specially related about user 1 and user 2 and they're
similar.


On Mon, May 9, 2011 at 3:09 AM, Thomas Söhngen <th...@beluto.com> wrote:
> Thank you for the explanation. I can understand the calculations now, but I
> still don't get the meaning. I think I'll try to sleep a night over it and
> try again tomorrow.
>
> Best regards,
> Thomas
>
> Am 09.05.2011 03:42, schrieb Ted Dunning:
>>
>> In this notation, k is assumed to be a matrix.  k_11 is the element in the
>> first row and first column.
>>
>> I used k to sound like count.
>>
>> The notation that you quote is R syntax.  rowSums is a function that
>> computes the row-wise sums of the argument k.  H is a function defined
>> elsewhere.
>>
>> On Sun, May 8, 2011 at 6:33 PM, Thomas Söhngen<th...@beluto.com>  wrote:
>>
>>> Thank you for the blog post and showing me the G-test formula.
>>>
>>> After going through your blog post, I still have some open questions: You
>>> introduce k_11 to k_22, but I don't understand what "k" itself actually
>>> stands for in your formular and how the sums are defined: LLR = 2 sum(k)
>>> (H(k) - H(rowSums(k)) - H(colSums(k)))
>>>
>>> Am 09.05.2011 02:46, schrieb Ted Dunning:
>>>
>>>  My guess is that the OP was asking about the generalized log-likelihood
>>>>
>>>> ratio test used in the Mahout recommendation framework.
>>>>
>>>> That is a bit different from what you describe in that it is the log of
>>>> the
>>>> ratio of two maximum likelihoods.
>>>>
>>>> See http://en.wikipedia.org/wiki/G-test for a definition of the test
>>>> used
>>>> in
>>>> Mahout.
>>>>
>>>> On Sun, May 8, 2011 at 5:43 PM, Jeremy Lewi<je...@lewi.us>   wrote:
>>>>
>>>>  Thomas,
>>>>>
>>>>> Are you asking a general question about log-likelihood or a specific
>>>>> implementation usage in Mahout?
>>>>>
>>>>> In general the likelihood is just a number, between 0 and 1 which
>>>>> measures the probability of observing some data under some
>>>>> distribution.
>>>>>
>>>>>
>>>>>
>

Re: Understanding log-likelihood

Posted by Ted Dunning <te...@gmail.com>.
Well, you have good company.

Meaning eludes us all in some sense.

On Sun, May 8, 2011 at 7:09 PM, Thomas Söhngen <th...@beluto.com> wrote:

> I can understand the calculations now, but I still don't get the meaning.

Re: Understanding log-likelihood

Posted by Thomas Söhngen <th...@beluto.com>.
Thank you for the explanation. I can understand the calculations now, 
but I still don't get the meaning. I think I'll try to sleep a night 
over it and try again tomorrow.

Best regards,
Thomas

Am 09.05.2011 03:42, schrieb Ted Dunning:
> In this notation, k is assumed to be a matrix.  k_11 is the element in the
> first row and first column.
>
> I used k to sound like count.
>
> The notation that you quote is R syntax.  rowSums is a function that
> computes the row-wise sums of the argument k.  H is a function defined
> elsewhere.
>
> On Sun, May 8, 2011 at 6:33 PM, Thomas Söhngen<th...@beluto.com>  wrote:
>
>> Thank you for the blog post and showing me the G-test formula.
>>
>> After going through your blog post, I still have some open questions: You
>> introduce k_11 to k_22, but I don't understand what "k" itself actually
>> stands for in your formular and how the sums are defined: LLR = 2 sum(k)
>> (H(k) - H(rowSums(k)) - H(colSums(k)))
>>
>> Am 09.05.2011 02:46, schrieb Ted Dunning:
>>
>>   My guess is that the OP was asking about the generalized log-likelihood
>>> ratio test used in the Mahout recommendation framework.
>>>
>>> That is a bit different from what you describe in that it is the log of
>>> the
>>> ratio of two maximum likelihoods.
>>>
>>> See http://en.wikipedia.org/wiki/G-test for a definition of the test used
>>> in
>>> Mahout.
>>>
>>> On Sun, May 8, 2011 at 5:43 PM, Jeremy Lewi<je...@lewi.us>   wrote:
>>>
>>>   Thomas,
>>>> Are you asking a general question about log-likelihood or a specific
>>>> implementation usage in Mahout?
>>>>
>>>> In general the likelihood is just a number, between 0 and 1 which
>>>> measures the probability of observing some data under some distribution.
>>>>
>>>>
>>>>

Re: Understanding log-likelihood

Posted by Ted Dunning <te...@gmail.com>.
In this notation, k is assumed to be a matrix.  k_11 is the element in the
first row and first column.

I used k to sound like count.

The notation that you quote is R syntax.  rowSums is a function that
computes the row-wise sums of the argument k.  H is a function defined
elsewhere.

On Sun, May 8, 2011 at 6:33 PM, Thomas Söhngen <th...@beluto.com> wrote:

> Thank you for the blog post and showing me the G-test formula.
>
> After going through your blog post, I still have some open questions: You
> introduce k_11 to k_22, but I don't understand what "k" itself actually
> stands for in your formular and how the sums are defined: LLR = 2 sum(k)
> (H(k) - H(rowSums(k)) - H(colSums(k)))
>
> Am 09.05.2011 02:46, schrieb Ted Dunning:
>
>  My guess is that the OP was asking about the generalized log-likelihood
>> ratio test used in the Mahout recommendation framework.
>>
>> That is a bit different from what you describe in that it is the log of
>> the
>> ratio of two maximum likelihoods.
>>
>> See http://en.wikipedia.org/wiki/G-test for a definition of the test used
>> in
>> Mahout.
>>
>> On Sun, May 8, 2011 at 5:43 PM, Jeremy Lewi<je...@lewi.us>  wrote:
>>
>>  Thomas,
>>>
>>> Are you asking a general question about log-likelihood or a specific
>>> implementation usage in Mahout?
>>>
>>> In general the likelihood is just a number, between 0 and 1 which
>>> measures the probability of observing some data under some distribution.
>>>
>>>
>>>

Re: Understanding log-likelihood

Posted by Thomas Söhngen <th...@beluto.com>.
Thank you for the blog post and showing me the G-test formula.

After going through your blog post, I still have some open questions: 
You introduce k_11 to k_22, but I don't understand what "k" itself 
actually stands for in your formular and how the sums are defined: LLR = 
2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k)))

Am 09.05.2011 02:46, schrieb Ted Dunning:
> My guess is that the OP was asking about the generalized log-likelihood
> ratio test used in the Mahout recommendation framework.
>
> That is a bit different from what you describe in that it is the log of the
> ratio of two maximum likelihoods.
>
> See http://en.wikipedia.org/wiki/G-test for a definition of the test used in
> Mahout.
>
> On Sun, May 8, 2011 at 5:43 PM, Jeremy Lewi<je...@lewi.us>  wrote:
>
>> Thomas,
>>
>> Are you asking a general question about log-likelihood or a specific
>> implementation usage in Mahout?
>>
>> In general the likelihood is just a number, between 0 and 1 which
>> measures the probability of observing some data under some distribution.
>>
>>

Re: Understanding log-likelihood

Posted by Jeremy Lewi <je...@lewi.us>.
I figured it was probably more mahout specific and not general stats.

J
On Sun, 2011-05-08 at 17:46 -0700, Ted Dunning wrote:
> My guess is that the OP was asking about the generalized log-likelihood
> ratio test used in the Mahout recommendation framework.
> 
> That is a bit different from what you describe in that it is the log of the
> ratio of two maximum likelihoods.
> 
> See http://en.wikipedia.org/wiki/G-test for a definition of the test used in
> Mahout.
> 
> On Sun, May 8, 2011 at 5:43 PM, Jeremy Lewi <je...@lewi.us> wrote:
> 
> > Thomas,
> >
> > Are you asking a general question about log-likelihood or a specific
> > implementation usage in Mahout?
> >
> > In general the likelihood is just a number, between 0 and 1 which
> > measures the probability of observing some data under some distribution.
> >
> >


Re: Understanding log-likelihood

Posted by Ted Dunning <te...@gmail.com>.
My guess is that the OP was asking about the generalized log-likelihood
ratio test used in the Mahout recommendation framework.

That is a bit different from what you describe in that it is the log of the
ratio of two maximum likelihoods.

See http://en.wikipedia.org/wiki/G-test for a definition of the test used in
Mahout.

On Sun, May 8, 2011 at 5:43 PM, Jeremy Lewi <je...@lewi.us> wrote:

> Thomas,
>
> Are you asking a general question about log-likelihood or a specific
> implementation usage in Mahout?
>
> In general the likelihood is just a number, between 0 and 1 which
> measures the probability of observing some data under some distribution.
>
>

Re: Understanding log-likelihood

Posted by Jeremy Lewi <je...@lewi.us>.
Thomas,

Are you asking a general question about log-likelihood or a specific
implementation usage in Mahout?

In general the likelihood is just a number, between 0 and 1 which
measures the probability of observing some data under some distribution.

So as a simple example, consider a coin toss. So the probability of
observing a heads is .5 and the probability of observing a tails is .5.

So now suppose you observe a toin coss, and the outcome is a heads. So
now we can ask how likely was this outcome under the assumption that
coin was fair. Well the likelihood in this case is just .5; because the
coin is fair.

So now suppose you observe two coin tosses, and the outcome is both
heads. How likely is this outcome? Since the tosses are independent, the
probability of getting two heads is simply the product of getting a
heads on both flips; so 
p(two heads| fair coin) = .5 *.5 = .25

We can also think of this as a counting problem and consider all
possible outcomes for flipping two coins. These are {HH,TH,HT,TT}.
Since the coin is fair all of these outcomes have equal probability of
1/4 or .25.

The probabilities we computed above are the likelihoods. The
log-likelihood is just the result of taking the log of them.

This is a pretty meager explanation so feel free to ask for
clarification.

J


On Mon, 2011-05-09 at 02:14 +0200, Thomas Söhngen wrote:
> Hello,
> 
> I struggle to understand the log-likelihood function. I would highly 
> welcome a simple example of how it is calculated, especially in Mahout.
> 
> Thanks in advance,
> Thomas


Understanding log-likelihood

Posted by Thomas Söhngen <th...@beluto.com>.
Hello,

I struggle to understand the log-likelihood function. I would highly 
welcome a simple example of how it is calculated, especially in Mahout.

Thanks in advance,
Thomas

Re: Why does SIMILARITY_EUCLIDEAN_DISTANCE only generates outputs with a similarity score of "1" for binary input?

Posted by Sean Owen <sr...@gmail.com>.
No, not in the distributed version. I think this would be a bad idea
performance-wise. Suddenly a sparse user-item matrix becomes dense --
a million data points can become a trillion.

However I think you can sort of get what you want by implementing a
variant on the Euclidean implementation that would "pretend" that
these values were filled in.

Say two items overlap in A users (both are "1" for those A users), and
then there are B other users that exist for one or the other item but
not both. The Euclidean distance between them is sqrt(B). If I recall
correctly from what Sebastian did, the "weight" arguments you get in
doComputeResult() are the occurrences of each item. And "A" is just
the number of cooccurrences you get in this method. So B = weightA +
weightB - 2 * (# of cooccurrences), and the Euclidean distance as you
want it is sqrt(B).

To get a similarity measure, you could use 1 / (1 + sqrt(B)). We use a
different formulation, A / (1 + sqrt(B)) which avoids penalizing two
vectors which have a lot of items in common, as they'd otherwise have
more chance of being farther away.

I'll point out that the TanimotoCoefficientSimilarity implementation,
which is also available to you, computes A / (A+B). Not the same, but
similar, so may be something you could use.

But really, log-likelihood is a good default choice.


On Sun, May 8, 2011 at 7:23 PM, Thomas Söhngen <th...@beluto.com> wrote:
> Is there a way to tell Mahout to "fill-up" the user-item matrix with zeros,
> when no rating is given for a user, item combination? I asume distance would
> become meaningful again then.
>
> Do you have any suggestions for scientific sources helping to choose an
> appropiate similarity function?
>
> Regards
> Thomas
>
>
> Am 08.05.2011 19:57, schrieb Sean Owen:
>>
>> All preferences are "1" in your world. Therefore user vectors are
>> always like (1,1,...,1). The distance between any two is 0, and the
>> similarity is 1. This metric is not appropriate for binary data. The
>> closest thing to what I think you want is the
>> TanimotoCoefficientsimilarity, but also try LogLikelihoodSimilarity.
>>
>> Yes, if you have a range of ratings, not just 1, it becomes meaningful
>> again to look at distance as a similarity metric.
>>
>> Sean
>>
>> On Sun, May 8, 2011 at 5:37 PM, Thomas Söhngen<th...@beluto.com>  wrote:
>>>
>>> Hello everyone,
>>>
>>> I am calculating similiar items with the SIMILARITY_EUCLIDEAN_DISTANCE
>>> class. My input is binary data, users clicking a like button. The output
>>> only generates similarities with a similarity score of "1". It doesn't
>>> calculate all items similiar to each other, but for the items it finds a
>>> similarity, the output is always "1". Why is this?
>>>
>>> I don't have the problem, when I also add a "dislike" information, with
>>> input lines "item_id,user_id,1" for a Like interaction and
>>> "item_id,user_id,-1" for dislikes. The similarity lies between 0 and 1
>>> then.
>>>
>>> Regards and thanks for suggestions,
>>> Thomas
>>>
>

Re: Why does SIMILARITY_EUCLIDEAN_DISTANCE only generates outputs with a similarity score of "1" for binary input?

Posted by Thomas Söhngen <th...@beluto.com>.
Is there a way to tell Mahout to "fill-up" the user-item matrix with 
zeros, when no rating is given for a user, item combination? I asume 
distance would become meaningful again then.

Do you have any suggestions for scientific sources helping to choose an 
appropiate similarity function?

Regards
Thomas


Am 08.05.2011 19:57, schrieb Sean Owen:
> All preferences are "1" in your world. Therefore user vectors are
> always like (1,1,...,1). The distance between any two is 0, and the
> similarity is 1. This metric is not appropriate for binary data. The
> closest thing to what I think you want is the
> TanimotoCoefficientsimilarity, but also try LogLikelihoodSimilarity.
>
> Yes, if you have a range of ratings, not just 1, it becomes meaningful
> again to look at distance as a similarity metric.
>
> Sean
>
> On Sun, May 8, 2011 at 5:37 PM, Thomas Söhngen<th...@beluto.com>  wrote:
>> Hello everyone,
>>
>> I am calculating similiar items with the SIMILARITY_EUCLIDEAN_DISTANCE
>> class. My input is binary data, users clicking a like button. The output
>> only generates similarities with a similarity score of "1". It doesn't
>> calculate all items similiar to each other, but for the items it finds a
>> similarity, the output is always "1". Why is this?
>>
>> I don't have the problem, when I also add a "dislike" information, with
>> input lines "item_id,user_id,1" for a Like interaction and
>> "item_id,user_id,-1" for dislikes. The similarity lies between 0 and 1 then.
>>
>> Regards and thanks for suggestions,
>> Thomas
>>

Re: Why does SIMILARITY_EUCLIDEAN_DISTANCE only generates outputs with a similarity score of "1" for binary input?

Posted by Sean Owen <sr...@gmail.com>.
All preferences are "1" in your world. Therefore user vectors are
always like (1,1,...,1). The distance between any two is 0, and the
similarity is 1. This metric is not appropriate for binary data. The
closest thing to what I think you want is the
TanimotoCoefficientsimilarity, but also try LogLikelihoodSimilarity.

Yes, if you have a range of ratings, not just 1, it becomes meaningful
again to look at distance as a similarity metric.

Sean

On Sun, May 8, 2011 at 5:37 PM, Thomas Söhngen <th...@beluto.com> wrote:
> Hello everyone,
>
> I am calculating similiar items with the SIMILARITY_EUCLIDEAN_DISTANCE
> class. My input is binary data, users clicking a like button. The output
> only generates similarities with a similarity score of "1". It doesn't
> calculate all items similiar to each other, but for the items it finds a
> similarity, the output is always "1". Why is this?
>
> I don't have the problem, when I also add a "dislike" information, with
> input lines "item_id,user_id,1" for a Like interaction and
> "item_id,user_id,-1" for dislikes. The similarity lies between 0 and 1 then.
>
> Regards and thanks for suggestions,
> Thomas
>

Why does SIMILARITY_EUCLIDEAN_DISTANCE only generates outputs with a similarity score of "1" for binary input?

Posted by Thomas Söhngen <th...@beluto.com>.
Hello everyone,

I am calculating similiar items with the SIMILARITY_EUCLIDEAN_DISTANCE 
class. My input is binary data, users clicking a like button. The output 
only generates similarities with a similarity score of "1". It doesn't 
calculate all items similiar to each other, but for the items it finds a 
similarity, the output is always "1". Why is this?

I don't have the problem, when I also add a "dislike" information, with 
input lines "item_id,user_id,1" for a Like interaction and 
"item_id,user_id,-1" for dislikes. The similarity lies between 0 and 1 then.

Regards and thanks for suggestions,
Thomas

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Benson Margulies <bi...@gmail.com>.
Well, of course. We made ours work by carefully tuning our use of the
assembly plugin to make sure that all the mahout jobs ended up in the
'unpacked' part of the circus, and Math and other dependencies ended
up jarred up in lib.

On Sun, May 8, 2011 at 7:01 PM, Sean Owen <sr...@gmail.com> wrote:
> Yea that's how I use it too ... but in my own larger structures I do
> repackage the .class files into the top level of the .jar to get it to
> work. I ran into the same thing as Jake. If it's not just me, there
> ought to be a solution.
>
> What if I removed the <unpack> stanzas in job.xml -- what would go
> wrong? Meaning, as it fixes the issue, if it doesn't do harm, let's
> make that change.
>
> On Sun, May 8, 2011 at 11:58 PM, Benson Margulies <bi...@gmail.com> wrote:
>> I'm not clear on how people manage to make the stock build fail. My
>> experience has been in trying to build Mahout jobs into larger
>> structures.
>

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Sean Owen <sr...@gmail.com>.
Yea that's how I use it too ... but in my own larger structures I do
repackage the .class files into the top level of the .jar to get it to
work. I ran into the same thing as Jake. If it's not just me, there
ought to be a solution.

What if I removed the <unpack> stanzas in job.xml -- what would go
wrong? Meaning, as it fixes the issue, if it doesn't do harm, let's
make that change.

On Sun, May 8, 2011 at 11:58 PM, Benson Margulies <bi...@gmail.com> wrote:
> I'm not clear on how people manage to make the stock build fail. My
> experience has been in trying to build Mahout jobs into larger
> structures.

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Benson Margulies <bi...@gmail.com>.
I'm not clear on how people manage to make the stock build fail. My
experience has been in trying to build Mahout jobs into larger
structures.

On Sun, May 8, 2011 at 6:44 PM, Sean Owen <sr...@gmail.com> wrote:
> It definitely works for me to package into one class. Is this merely
> "icky" or does it not work for another reason?
> Yes I'm not suggesting we make users tweak the Maven build, but that
> we make this tweak ourselves. It's just removing the overriding of
> "unpack" behavior in job.xml files that I mean.
>
> On Sun, May 8, 2011 at 11:36 PM, Benson Margulies <bi...@gmail.com> wrote:
>> There isn't a good solution for 0.5.
>>
>> The code that calls setJarByClass has to pass a class that is NOT in
>> the lib directory, but rather in the unpacked classes. It's really
>> easy to build a hadoop job with Mahout that violates that rule due to
>> all the static methods that create jobs.
>>
>> We seem to have a consensus to rework all the jobs as beans so that
>> this can be wrestled into control.
>>
>>
>>
>> On Sun, May 8, 2011 at 6:16 PM, Jake Mannix <ja...@gmail.com> wrote:
>>> On Sun, May 8, 2011 at 2:58 PM, Sean Owen <sr...@gmail.com> wrote:
>>>
>>>> If I recall the last discussion on this correctly --
>>>>
>>>> No you don't want to put anything in Hadoop's lib/ directory. Even if
>>>> you can, that's not the "right" way.
>>>> You want to use the job file indeed, which should contain all dependencies.
>>>> However, it packages dependencies as jars-in-the-jar, which doesn't
>>>> work for Hadoop.
>>>>
>>>
>>> I thought that hadoop was totally fine with jars inside of the jar, if
>>> they're
>>> in the lib directory?
>>>
>>>
>>>> I think if you modify the Maven build to just repackage all classes
>>>> into the main jar, it works. It works for me at least.
>>>>
>>>
>>> Clearly we're not expecting people to do this.  I wasn't even running with
>>> special new classes, it wasn't finding *Vector* - if this doesn't work on
>>> a real cluster, then most of our entire codebase (which requires
>>> mahout-math) doesn't work.
>>>
>>>  -jake
>>>
>>
>

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Ken Krugler <kk...@transpac.com>.
Hi Sean,

On May 8, 2011, at 4:14pm, Sean Owen wrote:

> (The build error indicates you have some old class files somewhere --
> "clean" first)

Thanks, my bad - our internal ant-based builds always do an implicit clean before an install.

> Here, the lib/ directory definitely has the right dependencies and it
> still doesn't work. Benson investigated and found out it's just how
> Hadoop works in this case.

The last thing I saw from Benson was:

> Well, of course. We made ours work by carefully tuning our use of the
> assembly plugin to make sure that all the mahout jobs ended up in the
> 'unpacked' part of the circus, and Math and other dependencies ended
> up jarred up in lib.

This is the key point - that anything which is run as the main class for the job has to be on the regular classpath, not buried in an embedded jar.

Mahout is interesting in that it's a mix of support code, map/reduce tasks, and jobs.

If I had to solve this problem, I'd look at establishing a naming convention where all such job classes end with "Job".

Then I could (easily, in ant - I'm sure Maven would require some black belt moves) have a build where these classes don't go into the regular mahout-xxx.jar, but rather a mahout-jobs.jar

Now, if I am incorporating Mahout into one of my projects, for my custom job jar:

* Add all of the regular mahout-xxx.jar files that I need as sub-jars in my job jar.

* Add the mahout-jobs.jar as a regular dependency.

-- Ken

> On Mon, May 9, 2011 at 12:06 AM, Ken Krugler
> <kk...@transpac.com> wrote:
>> I haven't been actively running Mahout for a while, but I do watch plenty of Hadoop students run into the ClassNotFoundException problem.
>> 
>> A standard Hadoop job jar has a lib subdir, which contains (as jars) all of the dependencies.
>> 
>> Typically the missing class problem is caused by somebody building their own Hadoop job jar, where they don't include a dependent jar (such as mahout-math) in the lib subdir.
>> 
>> Or somebody is trying to run a job locally, using the job jar directly, which then has to be unpacked as otherwise these embedded lib/*.jar classes aren't on the classpath.
>> 
>> But neither of those seem to match what Jake was doing:
>> 
>>> (just running things like "./bin/mahout svd -i <input> -o <output> etc... ")
>> 
>> 
>> I was going to try this out from trunk, but an svn up on trunk and then "mvn install" failed to pass one of the tests:
>> 
>>> Tests run: 2, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 0.025 sec <<< FAILURE!
>>> fullRankTall(org.apache.mahout.math.QRDecompositionTest)  Time elapsed: 0.014 sec  <<< ERROR!
>>> java.lang.NoSuchFieldError: MAX
>>>         at org.apache.mahout.math.QRDecompositionTest.assertEquals(QRDecompositionTest.java:122)
>>>         at org.apache.mahout.math.QRDecompositionTest.fullRankTall(QRDecompositionTest.java:38)
>> 
>> 
>> -- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Sean Owen <sr...@gmail.com>.
Ah right, that was the gotcha. Well, as it happens, there are no such
dependencies for Mahout at the moment. Those that really have to roll
their own .jar can use the Mahout .jar instead of job file to
construct something like this.

In the meantime Jake is right that we have a pretty immediate problem
-- the job file doesn't work. I guess few people ever use it directly
but I can say I saw the same thing. And we have a working solution --
just use Maven's deafult packaging logic, which re-jars stuff.

I'll make a JIRA to say yea or nay to with a patch.

On Mon, May 9, 2011 at 5:55 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> One solution is to use shade plugin or similar technique to create one
> job jar with all dependencies in it. Which i deem as a bad bractice,
> because it unjars existing dependencies jar, and it breaks things on
> occasion (e.g. if one of dependencies is a signed jar, such as
> BouncyCastle). Projects get into using shade plugin only to require
> major retrofit when they hit dependency like this.

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
or did i misread the problem?

another thought i had one time is that maybe hadoop would honor the
manifest classpath the same way regular java -jar does but i looked at
the code and i don't believe it did in 0.20.2.

On Sun, May 8, 2011 at 9:55 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> I never actually ran in this error. I guess my backend code never
> called anything ouside the jar.
>
> But i do, or, rather, did have similar problems with my project. I
> think i alreday voiced my opinion on this last time.
>
> One solution is to use shade plugin or similar technique to create one
> job jar with all dependencies in it. Which i deem as a bad bractice,
> because it unjars existing dependencies jar, and it breaks things on
> occasion (e.g. if one of dependencies is a signed jar, such as
> BouncyCastle). Projects get into using shade plugin only to require
> major retrofit when they hit dependency like this.
>
> A better and hadoop-like technique is to rework standard driver class
> so that it tosses everything assembly placed into lib into backend
> classpath explicitly using DistributedCache. Warning: this
> functionality is kind of broken in standard 0.20.2 somewhat, requires
> a hack to work.
>
> -d
>
> On Sun, May 8, 2011 at 5:09 PM, Jake Mannix <ja...@gmail.com> wrote:
>> I haven't run a post 0.4 mahout build on a production hadoop cluster before
>> this, and I'm baffled that we have a job jar which simply -does not work-.
>> Is this really not me, and our stock examples jobs are broken on hadoop?
>>
>>  -jake
>>
>> On May 8, 2011 4:14 PM, "Sean Owen" <sr...@gmail.com> wrote:
>>
>> (The build error indicates you have some old class files somewhere --
>> "clean" first)
>>
>> Here, the lib/ directory definitely has the right dependencies and it
>> still doesn't work. Benson investigated and found out it's just how
>> Hadoop works in this case.
>>
>> On Mon, May 9, 2011 at 12:06 AM, Ken Krugler <kk...@transpac.com>
>> wrote: > I haven't been ...
>>
>

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Ken Krugler <kk...@transpac.com>.
On May 9, 2011, at 10:53am, Dmitriy Lyubimov wrote:

> Ok then i am missing some fundamental knowledge here about the 'job
> jar'. It's probably a lame question, but i'll try to ask it anyway.
> What is a "job jar"? Is it a java spec or hadoop spec?

It's a Hadoop thing.

-- Ken

> 
> On Mon, May 9, 2011 at 10:51 AM, Jake Mannix <ja...@gmail.com> wrote:
>> On Mon, May 9, 2011 at 10:45 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> 
>>> On Mon, May 9, 2011 at 8:36 AM, Benson Margulies <bi...@gmail.com>
>>> wrote:
>>>> 
>>>> It is activated by calling the 'setJar' API in the job conf, passing
>>>> the name of the jar that contains the lib folder.
>>> 
>>> Just for my education, are you saying you can pass in the entire lib
>>> folder to the setJar() thing? i wasn't aware of this, the javadoc for
>>> this method doesn't imply that.
>>> 
>> 
>> No, he's just saying pass in the name of the jar which happens to have
>> a lib directory in it (the "job" jar).
>> 
>>  -jake
>> 

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
And on Cloudera distro.

That's another thing: we think we release it for EC2 or 0.20.2 but in
fact almost nobody is using 0.20.2 who has baremetal infrastructure,
so in fact we are  to cater to many distros in reality.

E.g. i know all tests passed for CDH until beta4, and beta4 and on
started breaking 2 or 3 of them. Too tight integration somewhere.

-d

On Mon, May 9, 2011 at 1:47 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> not on current trunk though.
>
> On Mon, May 9, 2011 at 1:46 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> Yes.
>>
>> and also seq2sparse. This one seems to work as long as you don't try
>> to insert your custom analyzer with an option. (I did not try though
>> but i am planning to and  remember people complained about it).
>>
>>
>>
>> On Mon, May 9, 2011 at 1:39 PM, Jake Mannix <ja...@gmail.com> wrote:
>>> On Mon, May 9, 2011 at 1:37 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>
>>>> I think we are certainly broken for backend use of Mahout (e.g. for
>>>> stuff like lucene analyzer strategies) but FWIW last time i tried to
>>>> run SSVD code it worked and it does use math stuff and it also does
>>>> setJarByClass.
>>>>
>>>> Unfortunately, i can't run much else at the moment.
>>>>
>>>
>>> How did you run it?  Via "./bin/mahout ssvd <args>" ?
>>>
>>>   -jake
>>>
>>
>

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
not on current trunk though.

On Mon, May 9, 2011 at 1:46 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> Yes.
>
> and also seq2sparse. This one seems to work as long as you don't try
> to insert your custom analyzer with an option. (I did not try though
> but i am planning to and  remember people complained about it).
>
>
>
> On Mon, May 9, 2011 at 1:39 PM, Jake Mannix <ja...@gmail.com> wrote:
>> On Mon, May 9, 2011 at 1:37 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>>> I think we are certainly broken for backend use of Mahout (e.g. for
>>> stuff like lucene analyzer strategies) but FWIW last time i tried to
>>> run SSVD code it worked and it does use math stuff and it also does
>>> setJarByClass.
>>>
>>> Unfortunately, i can't run much else at the moment.
>>>
>>
>> How did you run it?  Via "./bin/mahout ssvd <args>" ?
>>
>>   -jake
>>
>

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Yes.

and also seq2sparse. This one seems to work as long as you don't try
to insert your custom analyzer with an option. (I did not try though
but i am planning to and  remember people complained about it).



On Mon, May 9, 2011 at 1:39 PM, Jake Mannix <ja...@gmail.com> wrote:
> On Mon, May 9, 2011 at 1:37 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
>> I think we are certainly broken for backend use of Mahout (e.g. for
>> stuff like lucene analyzer strategies) but FWIW last time i tried to
>> run SSVD code it worked and it does use math stuff and it also does
>> setJarByClass.
>>
>> Unfortunately, i can't run much else at the moment.
>>
>
> How did you run it?  Via "./bin/mahout ssvd <args>" ?
>
>   -jake
>

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Jake Mannix <ja...@gmail.com>.
On Mon, May 9, 2011 at 1:37 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> I think we are certainly broken for backend use of Mahout (e.g. for
> stuff like lucene analyzer strategies) but FWIW last time i tried to
> run SSVD code it worked and it does use math stuff and it also does
> setJarByClass.
>
> Unfortunately, i can't run much else at the moment.
>

How did you run it?  Via "./bin/mahout ssvd <args>" ?

   -jake

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
I think we are certainly broken for backend use of Mahout (e.g. for
stuff like lucene analyzer strategies) but FWIW last time i tried to
run SSVD code it worked and it does use math stuff and it also does
setJarByClass.

Unfortunately, i can't run much else at the moment.

On Mon, May 9, 2011 at 1:20 PM, Jake Mannix <ja...@gmail.com> wrote:
> On Mon, May 9, 2011 at 1:09 PM, Benson Margulies <bi...@gmail.com>wrote:
>
>> Once more from the top.
>>
>> There is a hadoop convention. Is has nothing to do with the
>> MANIFEST.MF as I read the code.
>>
>
> Ah, sorry, that was something we do with these lib/-ified jars here at
> work (it's pretty common practice to do this, it's too bad it's not a
> java-supported spec).
>
>
>> I'm not an evangelist for the maven-shade-plugin, but my very
>> unscientific impression is that people walk up to mahout and expect
>> the mahout command to just 'work'. Unless someone can unveil a way to
>> script the exploitation of the distributed cache, that means that the
>> jar file that the mahout command hands to the hadoop command has to
>> use the 'lib/' convention, and have the correct structure of raw and
>> lib-ed classes.
>>
>
> Totally agree, if it works.
>
>
>> Further, any unsophisticated user who goes to incorporate Mahout into
>> a larger structure has to do likewise.
>>
>
> Well, users who want to incorporate mahout into a larger structure
> will have their own build system to interact with, and will need
> to be instructed to take our individual jars and package them
> up properly, no?
>
>
>> We could avoid exciting uses of the shade plugin altogether if we
>> didn't have these static methods that initialize jobs and call
>> setJarByClass on themselves. However, I don't see that for 0.5 unless
>> we want to push the schedule back and make a concerted effort.
>>
>> Further, I am concerned, based on Jake's remarks, that even following
>> the hadoop lib/ convention correctly doesn't always work, and we have
>> no diagnostic insight into the nature of the failure.
>>
>
> Can someone please try out our current code on another real cluster,
> so we have another data point?  My worry is that even without
> this setJarByClass business, we're not working properly.  If we are,
> I'm fine fixing this classpath stuff in 0.6
>
> If we're broken now, it needs fixing, asap.
>
>  -jake
>

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Jake Mannix <ja...@gmail.com>.
On Mon, May 9, 2011 at 1:09 PM, Benson Margulies <bi...@gmail.com>wrote:

> Once more from the top.
>
> There is a hadoop convention. Is has nothing to do with the
> MANIFEST.MF as I read the code.
>

Ah, sorry, that was something we do with these lib/-ified jars here at
work (it's pretty common practice to do this, it's too bad it's not a
java-supported spec).


> I'm not an evangelist for the maven-shade-plugin, but my very
> unscientific impression is that people walk up to mahout and expect
> the mahout command to just 'work'. Unless someone can unveil a way to
> script the exploitation of the distributed cache, that means that the
> jar file that the mahout command hands to the hadoop command has to
> use the 'lib/' convention, and have the correct structure of raw and
> lib-ed classes.
>

Totally agree, if it works.


> Further, any unsophisticated user who goes to incorporate Mahout into
> a larger structure has to do likewise.
>

Well, users who want to incorporate mahout into a larger structure
will have their own build system to interact with, and will need
to be instructed to take our individual jars and package them
up properly, no?


> We could avoid exciting uses of the shade plugin altogether if we
> didn't have these static methods that initialize jobs and call
> setJarByClass on themselves. However, I don't see that for 0.5 unless
> we want to push the schedule back and make a concerted effort.
>
> Further, I am concerned, based on Jake's remarks, that even following
> the hadoop lib/ convention correctly doesn't always work, and we have
> no diagnostic insight into the nature of the failure.
>

Can someone please try out our current code on another real cluster,
so we have another data point?  My worry is that even without
this setJarByClass business, we're not working properly.  If we are,
I'm fine fixing this classpath stuff in 0.6

If we're broken now, it needs fixing, asap.

  -jake

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
On Mon, May 9, 2011 at 1:44 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> On Mon, May 9, 2011 at 1:39 PM, Jake Mannix <ja...@gmail.com> wrote:
>> On Mon, May 9, 2011 at 1:31 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>
>> does what you suggest in another way.  Also we could, ourself,
>> copy these jars into the DistributedCache and do something that
>> way.
>>
>
> Yes. the latter. 2 line loop over $MAHOUT_HOME/lib with one call to
> distrubted cache inside.
>
>

It seems that DistributedCache is deprecated in 0.21 api though.
Apparently, one got to use Job#addArchiveToClassPath instead. I should
try that in new api jobs to see if this works seamlessly enough.

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
On Mon, May 9, 2011 at 1:39 PM, Jake Mannix <ja...@gmail.com> wrote:
> On Mon, May 9, 2011 at 1:31 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>> then AbstractJob implements walking the lib tree and adding those
>> paths (based on MAHOUT_HOME
>> or otherwise derived knowledge of lib location) and throws all the
>> jars there into backend path. all mahout projects
>> do something similar. Where's the complexity in that?
>>
>
> The complexity is right there: "throws all jars there into backend path".
>
> How do you wish to accomplish this?  Currently we follow the hadoop
> convention of doing this (lib/ inside of the jar passed to hadoop cli).
> It apparently doesn't always work (or never?  or is this PEBKAC?).
> We could alternately use the hadoop "-libjars" technique, which
> does what you suggest in another way.  Also we could, ourself,
> copy these jars into the DistributedCache and do something that
> way.
>

Yes. the latter. 2 line loop over $MAHOUT_HOME/lib with one call to
distrubted cache inside.



>
> I really wish I knew why the lib/ thing doesn't work for vanilla
> calls to classes in our examples job-jar.
>

Yes. Exactly. That's the problem with 'conventions' vs, specs.
They are they are not specs and since are subject to change on a whim.
And they are not api, either, so you can't validate contract
(to a certain degree only, of course) at compile time.

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Josh Patterson <jo...@cloudera.com>.
I've hit this (cant find math lib) before, too. Would love to see it
be less "black magic" and more "just works". =)

On Mon, May 9, 2011 at 6:40 PM, Jake Mannix <ja...@gmail.com> wrote:
> wah.  Even trying to do seq2sparse doesn't work for me:
>
> [jake@smf1-ady-15-sr1 mahout-distribution-0.5-SNAPSHOT]$ ./bin/mahout
> seq2sparse -i hdfs://<namenode>/user/jake/text_temp -o
> hdfs://<namenode>/user/jake/text_vectors_temp
> Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
> No HADOOP_CONF_DIR set, using /usr/lib/hadoop-0.20/src/conf
> 11/05/09 23:36:01 WARN driver.MahoutDriver: No seq2sparse.props found on
> classpath, will use command-line arguments only
> 11/05/09 23:36:01 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum
> n-gram size is: 1
> 11/05/09 23:36:01 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum
> LLR value: 1.0
> 11/05/09 23:36:01 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of
> reduce tasks: 1
> 11/05/09 23:36:04 INFO input.FileInputFormat: Total input paths to process :
> 1
> 11/05/09 23:36:10 INFO mapred.JobClient: Running job:
> job_201104300433_126621
> 11/05/09 23:36:12 INFO mapred.JobClient:  map 0% reduce 0%
> 11/05/09 23:36:47 INFO mapred.JobClient: Task Id :
> attempt_201104300433_126621_m_000000_0, Status : FAILED
> 11/05/09 23:37:07 INFO mapred.JobClient: Task Id :
> attempt_201104300433_126621_m_000000_1, Status : FAILED
> Error: java.lang.ClassNotFoundException: org.apache.lucene.analysis.Analyzer
>
> ----
>
> Note I'm not specifying any fancy analyzer.  Just trying to run with the
> defaults. :\
>
>  -jake
>
> On Mon, May 9, 2011 at 2:21 PM, Jake Mannix <ja...@gmail.com> wrote:
>
>>
>> On Mon, May 9, 2011 at 2:15 PM, Sean Owen <sr...@gmail.com> wrote:
>>
>>> I think I am still +1 to just creating one re-packaged .jar -- for now
>>> at least. It fixes problems for sure.
>>> And then I am happy for the cognoscenti to construct a better solution
>>> later, and I'd be pleased to help.
>>> Though I still don't find this re-packaging a bad thing -- theoretical
>>> problems with signing keys and whatnot, yes, but don't exist in
>>> practice now.
>>>
>>> I guess I'm asking whether anyone is for/against committing MAHOUT-691?
>>>
>>
>> I think for our examples job jar, this is a good idea, for now.
>>
>> I will try out your patch and see how it looks on my production cluster.
>>
>>   -jake
>>
>



-- 
Twitter: @jpatanooga
Solution Architect @ Cloudera
hadoop: http://www.cloudera.com
blog: http://jpatterson.floe.tv

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Frank Scholten <fr...@frankscholten.nl>.
Just ran seq2sparse on a clean checkout of trunk with a cluster
started by Whirr. This works without problems.

frank@franktop:~/Desktop/mahout$ bin/mahout seq2sparse --input
target/posts --output target/seq2sparse --weight tfidf  --namedVector
Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
HADOOP_CONF_DIR=/home/frank/.whirr/frank-cluster/
11/05/11 17:57:17 WARN conf.Configuration: DEPRECATED: hadoop-site.xml
found in the classpath. Usage of hadoop-site.xml is deprecated.
Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to
override properties of core-default.xml, mapred-default.xml and
hdfs-default.xml respectively
11/05/11 17:57:18 INFO vectorizer.SparseVectorsFromSequenceFiles:
Maximum n-gram size is: 1
11/05/11 17:57:18 INFO vectorizer.SparseVectorsFromSequenceFiles:
Minimum LLR value: 1.0
11/05/11 17:57:18 INFO vectorizer.SparseVectorsFromSequenceFiles:
Number of reduce tasks: 1
11/05/11 17:57:19 INFO common.HadoopUtil: Deleting target/seq2sparse
11/05/11 17:58:42 INFO input.FileInputFormat: Total input paths to process : 1
11/05/11 17:58:45 INFO mapred.JobClient: Running job: job_201105111409_0009
11/05/11 17:58:46 INFO mapred.JobClient:  map 0% reduce 0%
11/05/11 17:59:00 INFO mapred.JobClient:  map 100% reduce 0%

Frank

On Tue, May 10, 2011 at 5:34 PM, Jake Mannix <ja...@gmail.com> wrote:
> On Tue, May 10, 2011 at 8:24 AM, Sean Owen <sr...@gmail.com> wrote:
>
>> I peeked in the examples job jar and it definitely does have this class,
>> along with the other dependencies (after my patch). Double-check that
>> you've
>> done the clean build an "install" again? and maybe even print out
>> MAHOUT_JOB
>> in the script to double-check what it is using?
>>
>
> [jake@smf1-ady-15-sr1 bla]$ jar -tf mahout-examples-0.5-SNAPSHOT-job.jar |
> grep "/Analyzer.class"
> org/apache/lucene/analysis/Analyzer.class
>
> [swap exec for echo in last line of bin/mahout ]
>
> [jake@smf1-ady-15-sr1 mahout-distribution-0.5-SNAPSHOT]$ ./bin/mahout
> Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
> No HADOOP_CONF_DIR set, using /usr/lib/hadoop-0.20/src/conf
> /usr/lib/hadoop-0.20/bin/hadoop jar
> /home/jake/mahout-distribution-0.5-SNAPSHOT/mahout-examples-0.5-SNAPSHOT-job.jar
> org.apache.mahout.driver.MahoutDriver
>
> :\
>
>
>> On Tue, May 10, 2011 at 12:40 AM, Jake Mannix <ja...@gmail.com>
>> wrote:
>>
>> > wah.  Even trying to do seq2sparse doesn't work for me:
>> >
>> > [jake@smf1-ady-15-sr1 mahout-distribution-0.5-SNAPSHOT]$ ./bin/mahout
>> > seq2sparse -i hdfs://<namenode>/user/jake/text_temp -o
>> > hdfs://<namenode>/user/jake/text_vectors_temp
>> > Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
>> > No HADOOP_CONF_DIR set, using /usr/lib/hadoop-0.20/src/conf
>> > 11/05/09 23:36:01 WARN driver.MahoutDriver: No seq2sparse.props found on
>> > classpath, will use command-line arguments only
>> > 11/05/09 23:36:01 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum
>> > n-gram size is: 1
>> > 11/05/09 23:36:01 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum
>> > LLR value: 1.0
>> > 11/05/09 23:36:01 INFO vectorizer.SparseVectorsFromSequenceFiles: Number
>> of
>> > reduce tasks: 1
>> > 11/05/09 23:36:04 INFO input.FileInputFormat: Total input paths to
>> process
>> > :
>> > 1
>> > 11/05/09 23:36:10 INFO mapred.JobClient: Running job:
>> > job_201104300433_126621
>> > 11/05/09 23:36:12 INFO mapred.JobClient:  map 0% reduce 0%
>> > 11/05/09 23:36:47 INFO mapred.JobClient: Task Id :
>> > attempt_201104300433_126621_m_000000_0, Status : FAILED
>> > 11/05/09 23:37:07 INFO mapred.JobClient: Task Id :
>> > attempt_201104300433_126621_m_000000_1, Status : FAILED
>> > Error: java.lang.ClassNotFoundException:
>> > org.apache.lucene.analysis.Analyzer
>> >
>> > ----
>> >
>> > Note I'm not specifying any fancy analyzer.  Just trying to run with the
>> > defaults. :\
>> >
>> >  -jake
>>
>

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Jake Mannix <ja...@gmail.com>.
On Tue, May 10, 2011 at 8:24 AM, Sean Owen <sr...@gmail.com> wrote:

> I peeked in the examples job jar and it definitely does have this class,
> along with the other dependencies (after my patch). Double-check that
> you've
> done the clean build an "install" again? and maybe even print out
> MAHOUT_JOB
> in the script to double-check what it is using?
>

[jake@smf1-ady-15-sr1 bla]$ jar -tf mahout-examples-0.5-SNAPSHOT-job.jar |
grep "/Analyzer.class"
org/apache/lucene/analysis/Analyzer.class

[swap exec for echo in last line of bin/mahout ]

[jake@smf1-ady-15-sr1 mahout-distribution-0.5-SNAPSHOT]$ ./bin/mahout
Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
No HADOOP_CONF_DIR set, using /usr/lib/hadoop-0.20/src/conf
/usr/lib/hadoop-0.20/bin/hadoop jar
/home/jake/mahout-distribution-0.5-SNAPSHOT/mahout-examples-0.5-SNAPSHOT-job.jar
org.apache.mahout.driver.MahoutDriver

:\


> On Tue, May 10, 2011 at 12:40 AM, Jake Mannix <ja...@gmail.com>
> wrote:
>
> > wah.  Even trying to do seq2sparse doesn't work for me:
> >
> > [jake@smf1-ady-15-sr1 mahout-distribution-0.5-SNAPSHOT]$ ./bin/mahout
> > seq2sparse -i hdfs://<namenode>/user/jake/text_temp -o
> > hdfs://<namenode>/user/jake/text_vectors_temp
> > Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
> > No HADOOP_CONF_DIR set, using /usr/lib/hadoop-0.20/src/conf
> > 11/05/09 23:36:01 WARN driver.MahoutDriver: No seq2sparse.props found on
> > classpath, will use command-line arguments only
> > 11/05/09 23:36:01 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum
> > n-gram size is: 1
> > 11/05/09 23:36:01 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum
> > LLR value: 1.0
> > 11/05/09 23:36:01 INFO vectorizer.SparseVectorsFromSequenceFiles: Number
> of
> > reduce tasks: 1
> > 11/05/09 23:36:04 INFO input.FileInputFormat: Total input paths to
> process
> > :
> > 1
> > 11/05/09 23:36:10 INFO mapred.JobClient: Running job:
> > job_201104300433_126621
> > 11/05/09 23:36:12 INFO mapred.JobClient:  map 0% reduce 0%
> > 11/05/09 23:36:47 INFO mapred.JobClient: Task Id :
> > attempt_201104300433_126621_m_000000_0, Status : FAILED
> > 11/05/09 23:37:07 INFO mapred.JobClient: Task Id :
> > attempt_201104300433_126621_m_000000_1, Status : FAILED
> > Error: java.lang.ClassNotFoundException:
> > org.apache.lucene.analysis.Analyzer
> >
> > ----
> >
> > Note I'm not specifying any fancy analyzer.  Just trying to run with the
> > defaults. :\
> >
> >  -jake
>

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Sean Owen <sr...@gmail.com>.
I peeked in the examples job jar and it definitely does have this class,
along with the other dependencies (after my patch). Double-check that you've
done the clean build an "install" again? and maybe even print out MAHOUT_JOB
in the script to double-check what it is using?

On Tue, May 10, 2011 at 12:40 AM, Jake Mannix <ja...@gmail.com> wrote:

> wah.  Even trying to do seq2sparse doesn't work for me:
>
> [jake@smf1-ady-15-sr1 mahout-distribution-0.5-SNAPSHOT]$ ./bin/mahout
> seq2sparse -i hdfs://<namenode>/user/jake/text_temp -o
> hdfs://<namenode>/user/jake/text_vectors_temp
> Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
> No HADOOP_CONF_DIR set, using /usr/lib/hadoop-0.20/src/conf
> 11/05/09 23:36:01 WARN driver.MahoutDriver: No seq2sparse.props found on
> classpath, will use command-line arguments only
> 11/05/09 23:36:01 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum
> n-gram size is: 1
> 11/05/09 23:36:01 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum
> LLR value: 1.0
> 11/05/09 23:36:01 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of
> reduce tasks: 1
> 11/05/09 23:36:04 INFO input.FileInputFormat: Total input paths to process
> :
> 1
> 11/05/09 23:36:10 INFO mapred.JobClient: Running job:
> job_201104300433_126621
> 11/05/09 23:36:12 INFO mapred.JobClient:  map 0% reduce 0%
> 11/05/09 23:36:47 INFO mapred.JobClient: Task Id :
> attempt_201104300433_126621_m_000000_0, Status : FAILED
> 11/05/09 23:37:07 INFO mapred.JobClient: Task Id :
> attempt_201104300433_126621_m_000000_1, Status : FAILED
> Error: java.lang.ClassNotFoundException:
> org.apache.lucene.analysis.Analyzer
>
> ----
>
> Note I'm not specifying any fancy analyzer.  Just trying to run with the
> defaults. :\
>
>  -jake

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Jake Mannix <ja...@gmail.com>.
wah.  Even trying to do seq2sparse doesn't work for me:

[jake@smf1-ady-15-sr1 mahout-distribution-0.5-SNAPSHOT]$ ./bin/mahout
seq2sparse -i hdfs://<namenode>/user/jake/text_temp -o
hdfs://<namenode>/user/jake/text_vectors_temp
Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
No HADOOP_CONF_DIR set, using /usr/lib/hadoop-0.20/src/conf
11/05/09 23:36:01 WARN driver.MahoutDriver: No seq2sparse.props found on
classpath, will use command-line arguments only
11/05/09 23:36:01 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum
n-gram size is: 1
11/05/09 23:36:01 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum
LLR value: 1.0
11/05/09 23:36:01 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of
reduce tasks: 1
11/05/09 23:36:04 INFO input.FileInputFormat: Total input paths to process :
1
11/05/09 23:36:10 INFO mapred.JobClient: Running job:
job_201104300433_126621
11/05/09 23:36:12 INFO mapred.JobClient:  map 0% reduce 0%
11/05/09 23:36:47 INFO mapred.JobClient: Task Id :
attempt_201104300433_126621_m_000000_0, Status : FAILED
11/05/09 23:37:07 INFO mapred.JobClient: Task Id :
attempt_201104300433_126621_m_000000_1, Status : FAILED
Error: java.lang.ClassNotFoundException: org.apache.lucene.analysis.Analyzer

----

Note I'm not specifying any fancy analyzer.  Just trying to run with the
defaults. :\

  -jake

On Mon, May 9, 2011 at 2:21 PM, Jake Mannix <ja...@gmail.com> wrote:

>
> On Mon, May 9, 2011 at 2:15 PM, Sean Owen <sr...@gmail.com> wrote:
>
>> I think I am still +1 to just creating one re-packaged .jar -- for now
>> at least. It fixes problems for sure.
>> And then I am happy for the cognoscenti to construct a better solution
>> later, and I'd be pleased to help.
>> Though I still don't find this re-packaging a bad thing -- theoretical
>> problems with signing keys and whatnot, yes, but don't exist in
>> practice now.
>>
>> I guess I'm asking whether anyone is for/against committing MAHOUT-691?
>>
>
> I think for our examples job jar, this is a good idea, for now.
>
> I will try out your patch and see how it looks on my production cluster.
>
>   -jake
>

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Jake Mannix <ja...@gmail.com>.
On Mon, May 9, 2011 at 2:15 PM, Sean Owen <sr...@gmail.com> wrote:

> I think I am still +1 to just creating one re-packaged .jar -- for now
> at least. It fixes problems for sure.
> And then I am happy for the cognoscenti to construct a better solution
> later, and I'd be pleased to help.
> Though I still don't find this re-packaging a bad thing -- theoretical
> problems with signing keys and whatnot, yes, but don't exist in
> practice now.
>
> I guess I'm asking whether anyone is for/against committing MAHOUT-691?
>

I think for our examples job jar, this is a good idea, for now.

I will try out your patch and see how it looks on my production cluster.

  -jake

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Sean Owen <sr...@gmail.com>.
I think I am still +1 to just creating one re-packaged .jar -- for now
at least. It fixes problems for sure.
And then I am happy for the cognoscenti to construct a better solution
later, and I'd be pleased to help.
Though I still don't find this re-packaging a bad thing -- theoretical
problems with signing keys and whatnot, yes, but don't exist in
practice now.

I guess I'm asking whether anyone is for/against committing MAHOUT-691?

On Mon, May 9, 2011 at 9:39 PM, Jake Mannix <ja...@gmail.com> wrote:
> On Mon, May 9, 2011 at 1:31 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>> then AbstractJob implements walking the lib tree and adding those
>> paths (based on MAHOUT_HOME
>> or otherwise derived knowledge of lib location) and throws all the
>> jars there into backend path. all mahout projects
>> do something similar. Where's the complexity in that?
>>
>
> The complexity is right there: "throws all jars there into backend path".
>
> How do you wish to accomplish this?  Currently we follow the hadoop
> convention of doing this (lib/ inside of the jar passed to hadoop cli).
> It apparently doesn't always work (or never?  or is this PEBKAC?).
> We could alternately use the hadoop "-libjars" technique, which
> does what you suggest in another way.  Also we could, ourself,
> copy these jars into the DistributedCache and do something that
> way.
>
> But each of these has pros and cons, and figuring out which way
> works for the examples job, and our users, is the question we're
> getting at here.
>
> I really wish I knew why the lib/ thing doesn't work for vanilla
> calls to classes in our examples job-jar.
>
>  -jake
>

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Jake Mannix <ja...@gmail.com>.
On Mon, May 9, 2011 at 1:31 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> then AbstractJob implements walking the lib tree and adding those
> paths (based on MAHOUT_HOME
> or otherwise derived knowledge of lib location) and throws all the
> jars there into backend path. all mahout projects
> do something similar. Where's the complexity in that?
>

The complexity is right there: "throws all jars there into backend path".

How do you wish to accomplish this?  Currently we follow the hadoop
convention of doing this (lib/ inside of the jar passed to hadoop cli).
It apparently doesn't always work (or never?  or is this PEBKAC?).
We could alternately use the hadoop "-libjars" technique, which
does what you suggest in another way.  Also we could, ourself,
copy these jars into the DistributedCache and do something that
way.

But each of these has pros and cons, and figuring out which way
works for the examples job, and our users, is the question we're
getting at here.

I really wish I knew why the lib/ thing doesn't work for vanilla
calls to classes in our examples job-jar.

  -jake

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
>
> I'm not an evangelist for the maven-shade-plugin, but my very
> unscientific impression is that people walk up to mahout and expect
> the mahout command to just 'work'. Unless someone can unveil a way to
> script the exploitation of the distributed cache, that means that the
> jar file that the mahout command hands to the hadoop command has to
> use the 'lib/' convention, and have the correct structure of raw and
> lib-ed classes.

here is what i think :

We require to setup MAHOUT_HOME. well, most hadoop project require
something of the sort.

then AbstractJob implements walking the lib tree and adding those
paths (based on MAHOUT_HOME
or otherwise derived knowledge of lib location) and throws all the
jars there into backend path. all mahout projects
do something similar. Where's the complexity in that?

>
> Further, any unsophisticated user who goes to incorporate Mahout into
> a larger structure has to do likewise.

Yes.

There are two issues here;
1) client side api use.

That should be fine as long as MAHOUT_HOME points to the right place.
since user is not involved in writing driver code, we are golden.

2) backend side use of mahout? Not terribly expected, but maybe. E.g.
if mahout allows to specify external
strategies to do 'stuff' , such as external lucene analyzer in the
seq2sparse, yes.

In this case, well, we need to figure how to handle this ad-hoc thru
command line.
Let's look how other projects deal with the problem? Oh yes, they all
implement their own custom
mechanisms for these cases too!

Such as :

-- pig uses custom command register(jar)
-- hive has auxlib folder in HIVE_HOME where it expects find user jars!

Something similar should be good for us as a part of ecosystem, should it not?

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Benson Margulies <bi...@gmail.com>.
Once more from the top.

There is a hadoop convention. Is has nothing to do with the
MANIFEST.MF as I read the code.

In the hadoop convention, if someone calls setJar on the job conf, the
'lib/' folder of the indicated jar will be unpacked and the jars in it
added to the classpath on whatever nodes the job runs code on. If no
one calls setJar, then the only thing in the classpath is the jar
itself, unless you make other arrangements (as with the distributed
cache).

I'm not an evangelist for the maven-shade-plugin, but my very
unscientific impression is that people walk up to mahout and expect
the mahout command to just 'work'. Unless someone can unveil a way to
script the exploitation of the distributed cache, that means that the
jar file that the mahout command hands to the hadoop command has to
use the 'lib/' convention, and have the correct structure of raw and
lib-ed classes.

Further, any unsophisticated user who goes to incorporate Mahout into
a larger structure has to do likewise.

We could avoid exciting uses of the shade plugin altogether if we
didn't have these static methods that initialize jobs and call
setJarByClass on themselves. However, I don't see that for 0.5 unless
we want to push the schedule back and make a concerted effort.

Further, I am concerned, based on Jake's remarks, that even following
the hadoop lib/ convention correctly doesn't always work, and we have
no diagnostic insight into the nature of the failure.

So it seems at the instant as if our choices are to hold our noses and
shade, or give up on a trivial command line that runs our jobs without
a prerequisite of pushing the dependencies out into the cluster.

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
thank you, Jake, Ken.

Ok then "job" jar i!= shade plugin. It doesn't disassemble
dependencies jars, that is. (I guess i still need to find more data
about how it is used, but i get the idea).

I am good with this approach. I still slightly at odds with the fact
that it has to be done at the release (build) time, but whatever. If
we just needs to throw a pack of jars from driver side to classpath at
backend, DistributedCache is the tool for it (imo.)



On Mon, May 9, 2011 at 11:13 AM, Jake Mannix <ja...@gmail.com> wrote:
> On Mon, May 9, 2011 at 10:53 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
>> Ok then i am missing some fundamental knowledge here about the 'job
>> jar'. It's probably a lame question, but i'll try to ask it anyway.
>> What is a "job jar"? Is it a java spec or hadoop spec?
>>
>
> It's not a spec at all.  It's a hadoop convention.  The jar you pass in to
> the hadoop
> shell script "hadoop jar mystuff.jar myclassname -myargs") has a MANIFEST
> which appends to the classpath of the mappers and reducers the contents of
> its own lib directory (inside the jar), where other jars reside.
>
> This is exactly analogous to the way servlet containers deal with .war files
> (except that WAR files became an actual spec).
>
> People in hadoop-land call the "jar with manifest pointing to its own
> internal
> lib directory" as a "job" jar.
>
>  -jake
>

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Jake Mannix <ja...@gmail.com>.
On Mon, May 9, 2011 at 10:53 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Ok then i am missing some fundamental knowledge here about the 'job
> jar'. It's probably a lame question, but i'll try to ask it anyway.
> What is a "job jar"? Is it a java spec or hadoop spec?
>

It's not a spec at all.  It's a hadoop convention.  The jar you pass in to
the hadoop
shell script "hadoop jar mystuff.jar myclassname -myargs") has a MANIFEST
which appends to the classpath of the mappers and reducers the contents of
its own lib directory (inside the jar), where other jars reside.

This is exactly analogous to the way servlet containers deal with .war files
(except that WAR files became an actual spec).

People in hadoop-land call the "jar with manifest pointing to its own
internal
lib directory" as a "job" jar.

  -jake

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Ok then i am missing some fundamental knowledge here about the 'job
jar'. It's probably a lame question, but i'll try to ask it anyway.
What is a "job jar"? Is it a java spec or hadoop spec?

On Mon, May 9, 2011 at 10:51 AM, Jake Mannix <ja...@gmail.com> wrote:
> On Mon, May 9, 2011 at 10:45 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
>> On Mon, May 9, 2011 at 8:36 AM, Benson Margulies <bi...@gmail.com>
>> wrote:
>> >
>> > It is activated by calling the 'setJar' API in the job conf, passing
>> > the name of the jar that contains the lib folder.
>>
>> Just for my education, are you saying you can pass in the entire lib
>> folder to the setJar() thing? i wasn't aware of this, the javadoc for
>> this method doesn't imply that.
>>
>
> No, he's just saying pass in the name of the jar which happens to have
> a lib directory in it (the "job" jar).
>
>  -jake
>

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Jake Mannix <ja...@gmail.com>.
On Mon, May 9, 2011 at 10:45 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> On Mon, May 9, 2011 at 8:36 AM, Benson Margulies <bi...@gmail.com>
> wrote:
> >
> > It is activated by calling the 'setJar' API in the job conf, passing
> > the name of the jar that contains the lib folder.
>
> Just for my education, are you saying you can pass in the entire lib
> folder to the setJar() thing? i wasn't aware of this, the javadoc for
> this method doesn't imply that.
>

No, he's just saying pass in the name of the jar which happens to have
a lib directory in it (the "job" jar).

  -jake

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
On Mon, May 9, 2011 at 8:36 AM, Benson Margulies <bi...@gmail.com> wrote:
>
> It is activated by calling the 'setJar' API in the job conf, passing
> the name of the jar that contains the lib folder.

Just for my education, are you saying you can pass in the entire lib
folder to the setJar() thing? i wasn't aware of this, the javadoc for
this method doesn't imply that.

But if that's the case, it should be technically equivalent to
iterating thru lib folder and add jars there manually thru
DistributedCache. There's no difference from user's point of view,
they never get to modify the driver code, we do.

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Ted Dunning <te...@gmail.com>.
It is called jar-with-dependencies

On Mon, May 9, 2011 at 10:37 AM, Paul Mahon <pm...@decarta.com> wrote:

> I use the jarjar task with ant, I expect Maven has something simil

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Ok for what it's worth -1 on shade plugin.

Lack of dependencies in present doesn't make it a good practice IMO.

It cost equally the same to manage backend classpath "correctly" (i.e.
compatible way with all future dependencies) and in "incompatible" way
if we implement it now.

It may not be so easy in the future if we lock ourselves in doing that
in particular way.
I don't see any drawback (including "user support emails") with
Mahout's driver code (aka AbstractJob) doing it on its own vs.
handling it with shade. But there's something to gain. Such as
compliance with java jar spec.


On Mon, May 9, 2011 at 10:42 AM, Benson Margulies <bi...@gmail.com> wrote:
> Paul,
>
> The usual maven tool for the purpose is the shade plugin. As
> previously noted, there are some possible pitfalls, but there's
> evidence (including yours) that they are not relevant to Mahout at the
> moment.
>
> --benson
>
>
> On Mon, May 9, 2011 at 1:37 PM, Paul Mahon <pm...@decarta.com> wrote:
>> I use the one big jar technique for regular hadoop and mahout jobs because
>> of these kinds of problems. I use the jarjar task with ant, I expect Maven
>> has something similar. I haven't had any of the class not found problems
>> since I started doing it.
>>
>> On 05/09/2011 10:32 AM, Benson Margulies wrote:
>>>>
>>>> So that explains how some user rebundlings don't work with us, sometimes.
>>>> What it doesn't explain is why running the regular, not-rebundled
>>>> "mahout-examples-0.5-SNAPSHOT-job.jar" via the bin/mahout shell
>>>> script is throwing this ClassNotFoundException for me (and it's happened
>>>> to Sean, and according to the list archives, others as well) in a
>>>> production
>>>> cluster.
>>>
>>> I agree that it doesn't explain. However, the code in hadoop that
>>> implements this mechanism, well, if you ask me ... it STINKS. It
>>> wouldn't surprise me if it fails in some case we haven't
>>> characterized. This would argue for Sean's 'one big jar' approach.
>>
>

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Benson Margulies <bi...@gmail.com>.
Paul,

The usual maven tool for the purpose is the shade plugin. As
previously noted, there are some possible pitfalls, but there's
evidence (including yours) that they are not relevant to Mahout at the
moment.

--benson


On Mon, May 9, 2011 at 1:37 PM, Paul Mahon <pm...@decarta.com> wrote:
> I use the one big jar technique for regular hadoop and mahout jobs because
> of these kinds of problems. I use the jarjar task with ant, I expect Maven
> has something similar. I haven't had any of the class not found problems
> since I started doing it.
>
> On 05/09/2011 10:32 AM, Benson Margulies wrote:
>>>
>>> So that explains how some user rebundlings don't work with us, sometimes.
>>> What it doesn't explain is why running the regular, not-rebundled
>>> "mahout-examples-0.5-SNAPSHOT-job.jar" via the bin/mahout shell
>>> script is throwing this ClassNotFoundException for me (and it's happened
>>> to Sean, and according to the list archives, others as well) in a
>>> production
>>> cluster.
>>
>> I agree that it doesn't explain. However, the code in hadoop that
>> implements this mechanism, well, if you ask me ... it STINKS. It
>> wouldn't surprise me if it fails in some case we haven't
>> characterized. This would argue for Sean's 'one big jar' approach.
>

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Paul Mahon <pm...@decarta.com>.
I use the one big jar technique for regular hadoop and mahout jobs 
because of these kinds of problems. I use the jarjar task with ant, I 
expect Maven has something similar. I haven't had any of the class not 
found problems since I started doing it.

On 05/09/2011 10:32 AM, Benson Margulies wrote:
>> So that explains how some user rebundlings don't work with us, sometimes.
>> What it doesn't explain is why running the regular, not-rebundled
>> "mahout-examples-0.5-SNAPSHOT-job.jar" via the bin/mahout shell
>> script is throwing this ClassNotFoundException for me (and it's happened
>> to Sean, and according to the list archives, others as well) in a production
>> cluster.
> I agree that it doesn't explain. However, the code in hadoop that
> implements this mechanism, well, if you ask me ... it STINKS. It
> wouldn't surprise me if it fails in some case we haven't
> characterized. This would argue for Sean's 'one big jar' approach.

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Benson Margulies <bi...@gmail.com>.
>
> So that explains how some user rebundlings don't work with us, sometimes.
> What it doesn't explain is why running the regular, not-rebundled
> "mahout-examples-0.5-SNAPSHOT-job.jar" via the bin/mahout shell
> script is throwing this ClassNotFoundException for me (and it's happened
> to Sean, and according to the list archives, others as well) in a production
> cluster.

I agree that it doesn't explain. However, the code in hadoop that
implements this mechanism, well, if you ask me ... it STINKS. It
wouldn't surprise me if it fails in some case we haven't
characterized. This would argue for Sean's 'one big jar' approach.

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Jake Mannix <ja...@gmail.com>.
On Mon, May 9, 2011 at 8:36 AM, Benson Margulies <bi...@gmail.com>wrote:
>
> As a convenience (and a trap for the unwary), there is a convenience:
> setJarByClass. This takes a Class<?> instead of a string jar path. It
> attempts to derive a jar name from the class reference.
>

Ok, this last bit is the piece of the puzzle I was missing / forgetting.


> Mahout then has a series of self-contained classes that create JobConf
> objects, and make calls to setJarByClass, passing Whatever.class. If
> one of those classes somehow wanders into lib/ (like, a person
> building a job jar puts mahout into 'lib/' and then tries to use a
> Mahout job class) the call to setJarByClass is at best ineffective and
> at worst destructive.
>

So that explains how some user rebundlings don't work with us, sometimes.
What it doesn't explain is why running the regular, not-rebundled
"mahout-examples-0.5-SNAPSHOT-job.jar" via the bin/mahout shell
script is throwing this ClassNotFoundException for me (and it's happened
to Sean, and according to the list archives, others as well) in a production
cluster.

  -jake

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Sean Owen <sr...@gmail.com>.
(I mean the latter. One .jar with no .jar files inside.)
I think we should let this simmer a bit to see if anyone has more
bright ideas. While we have a working solution (good) I dislike
committing something that you find ugly.

On Mon, May 9, 2011 at 5:08 PM, Benson Margulies <bi...@gmail.com> wrote:
> The term 'make one jar' is a bit ambiguous to me. If you mean, 'make
> one job jar that has all the job classes in the unpacked parts and all
> the dependencies in lib', fine. If you mean 'shade it all into one
> giant jar without a lib', I'm slightly squeamish -- but there are
> options in shade to assure that this works right, even given bouncy
> castle or other things with SPIs if we have any to worry about.

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Benson Margulies <bi...@gmail.com>.
The term 'make one jar' is a bit ambiguous to me. If you mean, 'make
one job jar that has all the job classes in the unpacked parts and all
the dependencies in lib', fine. If you mean 'shade it all into one
giant jar without a lib', I'm slightly squeamish -- but there are
options in shade to assure that this works right, even given bouncy
castle or other things with SPIs if we have any to worry about.

On Mon, May 9, 2011 at 11:41 AM, Sean Owen <sr...@gmail.com> wrote:
> Ah, OK. The trickiness there is that we don't know the location of the
> jar. (Right?) The user can tell us, though they're then specifying it
> twice on the command line, once to Hadoop and once to us. At least I
> don't know of something smarter.
>
> Is there any better interim solution than just packaging it all up
> into one .jar? that obviates this issue. (That doesn't personally
> offend my sense of hackiness and propriety anyway, but I do see the
> arguments there.) Because it looks like we need to do *something* for
> now.
>
> And then I bet there's a better long term answer even as I don't know
> what it is. Heck, if someone does know and it's not too hard, I'll
> make it happen now.
>
> Sean
>
> On Mon, May 9, 2011 at 4:36 PM, Benson Margulies <bi...@gmail.com> wrote:
>> The 'lib/' convention is not a feature of Java, it's a feature of hadoop.
>>
>> It is activated by calling the 'setJar' API in the job conf, passing
>> the name of the jar that contains the lib folder.
>>
>> As a convenience (and a trap for the unwary), there is a convenience:
>> setJarByClass. This takes a Class<?> instead of a string jar path. It
>> attempts to derive a jar name from the class reference.
>>
>> Mahout then has a series of self-contained classes that create JobConf
>> objects, and make calls to setJarByClass, passing Whatever.class. If
>> one of those classes somehow wanders into lib/ (like, a person
>> building a job jar puts mahout into 'lib/' and then tries to use a
>> Mahout job class) the call to setJarByClass is at best ineffective and
>> at worst destructive.
>>
>> On Mon, May 9, 2011 at 11:07 AM, Jake Mannix <ja...@gmail.com> wrote:
>>> Benson,
>>>
>>>  Can you remind me what the "setJarByClass" issue is again?
>>>
>>> On May 9, 2011 6:30 AM, "Benson Margulies" <bi...@gmail.com> wrote:
>>>
>>> I see no reason to stop using the 'lib/' convention in our jobs.
>>>
>>> There are apparently plenty of people out there who don't know
>>> anything about the distributed cache. If we require it's use to run
>>> simple jobs, we're going to be up to our ears in support email.
>>>
>>> I favor the following strategy:
>>>
>>> 1) Make sure that the split between 'libs/' and unpacked classes in
>>> our job jars is *correct* so that all the operations of the mahout
>>> command work out of the box.
>>>
>>> 2) post 0.5, act on the proposed refactoring so that none of our code
>>> is calling setJarFromClass in a way that forces users to do complex
>>> re-shading for themselves. That's the 'bean' proposal, in which each
>>> of our jobs is a bean, and a user who wants to combine ours and theirs
>>> can make their own call to setJar/setJarFromClass appropriately.
>>>
>>
>

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Sean Owen <sr...@gmail.com>.
Ah, OK. The trickiness there is that we don't know the location of the
jar. (Right?) The user can tell us, though they're then specifying it
twice on the command line, once to Hadoop and once to us. At least I
don't know of something smarter.

Is there any better interim solution than just packaging it all up
into one .jar? that obviates this issue. (That doesn't personally
offend my sense of hackiness and propriety anyway, but I do see the
arguments there.) Because it looks like we need to do *something* for
now.

And then I bet there's a better long term answer even as I don't know
what it is. Heck, if someone does know and it's not too hard, I'll
make it happen now.

Sean

On Mon, May 9, 2011 at 4:36 PM, Benson Margulies <bi...@gmail.com> wrote:
> The 'lib/' convention is not a feature of Java, it's a feature of hadoop.
>
> It is activated by calling the 'setJar' API in the job conf, passing
> the name of the jar that contains the lib folder.
>
> As a convenience (and a trap for the unwary), there is a convenience:
> setJarByClass. This takes a Class<?> instead of a string jar path. It
> attempts to derive a jar name from the class reference.
>
> Mahout then has a series of self-contained classes that create JobConf
> objects, and make calls to setJarByClass, passing Whatever.class. If
> one of those classes somehow wanders into lib/ (like, a person
> building a job jar puts mahout into 'lib/' and then tries to use a
> Mahout job class) the call to setJarByClass is at best ineffective and
> at worst destructive.
>
> On Mon, May 9, 2011 at 11:07 AM, Jake Mannix <ja...@gmail.com> wrote:
>> Benson,
>>
>>  Can you remind me what the "setJarByClass" issue is again?
>>
>> On May 9, 2011 6:30 AM, "Benson Margulies" <bi...@gmail.com> wrote:
>>
>> I see no reason to stop using the 'lib/' convention in our jobs.
>>
>> There are apparently plenty of people out there who don't know
>> anything about the distributed cache. If we require it's use to run
>> simple jobs, we're going to be up to our ears in support email.
>>
>> I favor the following strategy:
>>
>> 1) Make sure that the split between 'libs/' and unpacked classes in
>> our job jars is *correct* so that all the operations of the mahout
>> command work out of the box.
>>
>> 2) post 0.5, act on the proposed refactoring so that none of our code
>> is calling setJarFromClass in a way that forces users to do complex
>> re-shading for themselves. That's the 'bean' proposal, in which each
>> of our jobs is a bean, and a user who wants to combine ours and theirs
>> can make their own call to setJar/setJarFromClass appropriately.
>>
>

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Benson Margulies <bi...@gmail.com>.
The 'lib/' convention is not a feature of Java, it's a feature of hadoop.

It is activated by calling the 'setJar' API in the job conf, passing
the name of the jar that contains the lib folder.

As a convenience (and a trap for the unwary), there is a convenience:
setJarByClass. This takes a Class<?> instead of a string jar path. It
attempts to derive a jar name from the class reference.

Mahout then has a series of self-contained classes that create JobConf
objects, and make calls to setJarByClass, passing Whatever.class. If
one of those classes somehow wanders into lib/ (like, a person
building a job jar puts mahout into 'lib/' and then tries to use a
Mahout job class) the call to setJarByClass is at best ineffective and
at worst destructive.

On Mon, May 9, 2011 at 11:07 AM, Jake Mannix <ja...@gmail.com> wrote:
> Benson,
>
>  Can you remind me what the "setJarByClass" issue is again?
>
> On May 9, 2011 6:30 AM, "Benson Margulies" <bi...@gmail.com> wrote:
>
> I see no reason to stop using the 'lib/' convention in our jobs.
>
> There are apparently plenty of people out there who don't know
> anything about the distributed cache. If we require it's use to run
> simple jobs, we're going to be up to our ears in support email.
>
> I favor the following strategy:
>
> 1) Make sure that the split between 'libs/' and unpacked classes in
> our job jars is *correct* so that all the operations of the mahout
> command work out of the box.
>
> 2) post 0.5, act on the proposed refactoring so that none of our code
> is calling setJarFromClass in a way that forces users to do complex
> re-shading for themselves. That's the 'bean' proposal, in which each
> of our jobs is a bean, and a user who wants to combine ours and theirs
> can make their own call to setJar/setJarFromClass appropriately.
>

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Jake Mannix <ja...@gmail.com>.
Benson,

  Can you remind me what the "setJarByClass" issue is again?

On May 9, 2011 6:30 AM, "Benson Margulies" <bi...@gmail.com> wrote:

I see no reason to stop using the 'lib/' convention in our jobs.

There are apparently plenty of people out there who don't know
anything about the distributed cache. If we require it's use to run
simple jobs, we're going to be up to our ears in support email.

I favor the following strategy:

1) Make sure that the split between 'libs/' and unpacked classes in
our job jars is *correct* so that all the operations of the mahout
command work out of the box.

2) post 0.5, act on the proposed refactoring so that none of our code
is calling setJarFromClass in a way that forces users to do complex
re-shading for themselves. That's the 'bean' proposal, in which each
of our jobs is a bean, and a user who wants to combine ours and theirs
can make their own call to setJar/setJarFromClass appropriately.

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Jake Mannix <ja...@gmail.com>.
On Mon, May 9, 2011 at 8:18 AM, Sean Owen <sr...@gmail.com> wrote:

> I am not wedded to changing or not changing a convention -- both
> strike me as equally valid. The question is what works of course.
>
> My understanding is that the dependency is definitely in lib/, but
> still isn't working. That was my experience too. So I am not sure if
> it's a question of having the right thing in there?
>
> I'm maybe being dense but what is the way forward then? I have one
> answer on the table FWIW.
>

Your solution is the only one I *know* works on typical production clusters
(in 99% of cases, including ours).

I'm going to ask my local resident hadoop experts to see if they've
got a less hacky, yet still functional, solution.

  -jake

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Sean Owen <sr...@gmail.com>.
I am not wedded to changing or not changing a convention -- both
strike me as equally valid. The question is what works of course.

My understanding is that the dependency is definitely in lib/, but
still isn't working. That was my experience too. So I am not sure if
it's a question of having the right thing in there?

I'm maybe being dense but what is the way forward then? I have one
answer on the table FWIW.

On Mon, May 9, 2011 at 2:30 PM, Benson Margulies <bi...@gmail.com> wrote:
> I see no reason to stop using the 'lib/' convention in our jobs.
>
> There are apparently plenty of people out there who don't know
> anything about the distributed cache. If we require it's use to run
> simple jobs, we're going to be up to our ears in support email.
>
> I favor the following strategy:
>
> 1) Make sure that the split between 'libs/' and unpacked classes in
> our job jars is *correct* so that all the operations of the mahout
> command work out of the box.
>
> 2) post 0.5, act on the proposed refactoring so that none of our code
> is calling setJarFromClass in a way that forces users to do complex
> re-shading for themselves. That's the 'bean' proposal, in which each
> of our jobs is a bean, and a user who wants to combine ours and theirs
> can make their own call to setJar/setJarFromClass appropriately.
>

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Benson Margulies <bi...@gmail.com>.
I see no reason to stop using the 'lib/' convention in our jobs.

There are apparently plenty of people out there who don't know
anything about the distributed cache. If we require it's use to run
simple jobs, we're going to be up to our ears in support email.

I favor the following strategy:

1) Make sure that the split between 'libs/' and unpacked classes in
our job jars is *correct* so that all the operations of the mahout
command work out of the box.

2) post 0.5, act on the proposed refactoring so that none of our code
is calling setJarFromClass in a way that forces users to do complex
re-shading for themselves. That's the 'bean' proposal, in which each
of our jobs is a bean, and a user who wants to combine ours and theirs
can make their own call to setJar/setJarFromClass appropriately.

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
On Sun, May 8, 2011 at 10:49 PM, Jake Mannix <ja...@gmail.com> wrote:
> On Sun, May 8, 2011 at 9:55 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
>> I never actually ran in this error. I guess my backend code never
>> called anything ouside the jar.
>>

I do.. but i still never ran into it.. not sure how it happened... I
just use regular mahout abstract job and configured mahout <comand> by
analogy with others and it just worked.

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Jake Mannix <ja...@gmail.com>.
On Sun, May 8, 2011 at 9:55 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> I never actually ran in this error. I guess my backend code never
> called anything ouside the jar.
>

You never use elements of mahout-math in your mappers or reducers?
o.a.m.math.Vector, for instance?


> A better and hadoop-like technique is to rework standard driver class
> so that it tosses everything assembly placed into lib into backend
> classpath explicitly using DistributedCache. Warning: this
> functionality is kind of broken in standard 0.20.2 somewhat, requires
> a hack to work.
>

But I know I've used the whole "jars inside of the lib/ directory" trick
on Hadoop clusters.  I've done it even with Mahout to demonstrate
it on EC2, for example.

Did something change in the way we package our job jar since we
last had a release?

  -jake

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
I never actually ran in this error. I guess my backend code never
called anything ouside the jar.

But i do, or, rather, did have similar problems with my project. I
think i alreday voiced my opinion on this last time.

One solution is to use shade plugin or similar technique to create one
job jar with all dependencies in it. Which i deem as a bad bractice,
because it unjars existing dependencies jar, and it breaks things on
occasion (e.g. if one of dependencies is a signed jar, such as
BouncyCastle). Projects get into using shade plugin only to require
major retrofit when they hit dependency like this.

A better and hadoop-like technique is to rework standard driver class
so that it tosses everything assembly placed into lib into backend
classpath explicitly using DistributedCache. Warning: this
functionality is kind of broken in standard 0.20.2 somewhat, requires
a hack to work.

-d

On Sun, May 8, 2011 at 5:09 PM, Jake Mannix <ja...@gmail.com> wrote:
> I haven't run a post 0.4 mahout build on a production hadoop cluster before
> this, and I'm baffled that we have a job jar which simply -does not work-.
> Is this really not me, and our stock examples jobs are broken on hadoop?
>
>  -jake
>
> On May 8, 2011 4:14 PM, "Sean Owen" <sr...@gmail.com> wrote:
>
> (The build error indicates you have some old class files somewhere --
> "clean" first)
>
> Here, the lib/ directory definitely has the right dependencies and it
> still doesn't work. Benson investigated and found out it's just how
> Hadoop works in this case.
>
> On Mon, May 9, 2011 at 12:06 AM, Ken Krugler <kk...@transpac.com>
> wrote: > I haven't been ...
>

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Jake Mannix <ja...@gmail.com>.
I haven't run a post 0.4 mahout build on a production hadoop cluster before
this, and I'm baffled that we have a job jar which simply -does not work-.
Is this really not me, and our stock examples jobs are broken on hadoop?

  -jake

On May 8, 2011 4:14 PM, "Sean Owen" <sr...@gmail.com> wrote:

(The build error indicates you have some old class files somewhere --
"clean" first)

Here, the lib/ directory definitely has the right dependencies and it
still doesn't work. Benson investigated and found out it's just how
Hadoop works in this case.

On Mon, May 9, 2011 at 12:06 AM, Ken Krugler <kk...@transpac.com>
wrote: > I haven't been ...

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Sean Owen <sr...@gmail.com>.
(The build error indicates you have some old class files somewhere --
"clean" first)

Here, the lib/ directory definitely has the right dependencies and it
still doesn't work. Benson investigated and found out it's just how
Hadoop works in this case.

On Mon, May 9, 2011 at 12:06 AM, Ken Krugler
<kk...@transpac.com> wrote:
> I haven't been actively running Mahout for a while, but I do watch plenty of Hadoop students run into the ClassNotFoundException problem.
>
> A standard Hadoop job jar has a lib subdir, which contains (as jars) all of the dependencies.
>
> Typically the missing class problem is caused by somebody building their own Hadoop job jar, where they don't include a dependent jar (such as mahout-math) in the lib subdir.
>
> Or somebody is trying to run a job locally, using the job jar directly, which then has to be unpacked as otherwise these embedded lib/*.jar classes aren't on the classpath.
>
> But neither of those seem to match what Jake was doing:
>
>> (just running things like "./bin/mahout svd -i <input> -o <output> etc... ")
>
>
> I was going to try this out from trunk, but an svn up on trunk and then "mvn install" failed to pass one of the tests:
>
>> Tests run: 2, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 0.025 sec <<< FAILURE!
>> fullRankTall(org.apache.mahout.math.QRDecompositionTest)  Time elapsed: 0.014 sec  <<< ERROR!
>> java.lang.NoSuchFieldError: MAX
>>         at org.apache.mahout.math.QRDecompositionTest.assertEquals(QRDecompositionTest.java:122)
>>         at org.apache.mahout.math.QRDecompositionTest.fullRankTall(QRDecompositionTest.java:38)
>
>
> -- Ken

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Ken Krugler <kk...@transpac.com>.
I haven't been actively running Mahout for a while, but I do watch plenty of Hadoop students run into the ClassNotFoundException problem.

A standard Hadoop job jar has a lib subdir, which contains (as jars) all of the dependencies.

Typically the missing class problem is caused by somebody building their own Hadoop job jar, where they don't include a dependent jar (such as mahout-math) in the lib subdir.

Or somebody is trying to run a job locally, using the job jar directly, which then has to be unpacked as otherwise these embedded lib/*.jar classes aren't on the classpath.

But neither of those seem to match what Jake was doing:

> (just running things like "./bin/mahout svd -i <input> -o <output> etc... ")


I was going to try this out from trunk, but an svn up on trunk and then "mvn install" failed to pass one of the tests:

> Tests run: 2, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 0.025 sec <<< FAILURE!
> fullRankTall(org.apache.mahout.math.QRDecompositionTest)  Time elapsed: 0.014 sec  <<< ERROR!
> java.lang.NoSuchFieldError: MAX
>         at org.apache.mahout.math.QRDecompositionTest.assertEquals(QRDecompositionTest.java:122)
>         at org.apache.mahout.math.QRDecompositionTest.fullRankTall(QRDecompositionTest.java:38)


-- Ken

On May 8, 2011, at 3:44pm, Sean Owen wrote:

> It definitely works for me to package into one class. Is this merely
> "icky" or does it not work for another reason?
> Yes I'm not suggesting we make users tweak the Maven build, but that
> we make this tweak ourselves. It's just removing the overriding of
> "unpack" behavior in job.xml files that I mean.
> 
> On Sun, May 8, 2011 at 11:36 PM, Benson Margulies <bi...@gmail.com> wrote:
>> There isn't a good solution for 0.5.
>> 
>> The code that calls setJarByClass has to pass a class that is NOT in
>> the lib directory, but rather in the unpacked classes. It's really
>> easy to build a hadoop job with Mahout that violates that rule due to
>> all the static methods that create jobs.
>> 
>> We seem to have a consensus to rework all the jobs as beans so that
>> this can be wrestled into control.
>> 
>> 
>> 
>> On Sun, May 8, 2011 at 6:16 PM, Jake Mannix <ja...@gmail.com> wrote:
>>> On Sun, May 8, 2011 at 2:58 PM, Sean Owen <sr...@gmail.com> wrote:
>>> 
>>>> If I recall the last discussion on this correctly --
>>>> 
>>>> No you don't want to put anything in Hadoop's lib/ directory. Even if
>>>> you can, that's not the "right" way.
>>>> You want to use the job file indeed, which should contain all dependencies.
>>>> However, it packages dependencies as jars-in-the-jar, which doesn't
>>>> work for Hadoop.
>>>> 
>>> 
>>> I thought that hadoop was totally fine with jars inside of the jar, if
>>> they're
>>> in the lib directory?
>>> 
>>> 
>>>> I think if you modify the Maven build to just repackage all classes
>>>> into the main jar, it works. It works for me at least.
>>>> 
>>> 
>>> Clearly we're not expecting people to do this.  I wasn't even running with
>>> special new classes, it wasn't finding *Vector* - if this doesn't work on
>>> a real cluster, then most of our entire codebase (which requires
>>> mahout-math) doesn't work.
>>> 
>>>  -jake
>>> 
>> 

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Sean Owen <sr...@gmail.com>.
It definitely works for me to package into one class. Is this merely
"icky" or does it not work for another reason?
Yes I'm not suggesting we make users tweak the Maven build, but that
we make this tweak ourselves. It's just removing the overriding of
"unpack" behavior in job.xml files that I mean.

On Sun, May 8, 2011 at 11:36 PM, Benson Margulies <bi...@gmail.com> wrote:
> There isn't a good solution for 0.5.
>
> The code that calls setJarByClass has to pass a class that is NOT in
> the lib directory, but rather in the unpacked classes. It's really
> easy to build a hadoop job with Mahout that violates that rule due to
> all the static methods that create jobs.
>
> We seem to have a consensus to rework all the jobs as beans so that
> this can be wrestled into control.
>
>
>
> On Sun, May 8, 2011 at 6:16 PM, Jake Mannix <ja...@gmail.com> wrote:
>> On Sun, May 8, 2011 at 2:58 PM, Sean Owen <sr...@gmail.com> wrote:
>>
>>> If I recall the last discussion on this correctly --
>>>
>>> No you don't want to put anything in Hadoop's lib/ directory. Even if
>>> you can, that's not the "right" way.
>>> You want to use the job file indeed, which should contain all dependencies.
>>> However, it packages dependencies as jars-in-the-jar, which doesn't
>>> work for Hadoop.
>>>
>>
>> I thought that hadoop was totally fine with jars inside of the jar, if
>> they're
>> in the lib directory?
>>
>>
>>> I think if you modify the Maven build to just repackage all classes
>>> into the main jar, it works. It works for me at least.
>>>
>>
>> Clearly we're not expecting people to do this.  I wasn't even running with
>> special new classes, it wasn't finding *Vector* - if this doesn't work on
>> a real cluster, then most of our entire codebase (which requires
>> mahout-math) doesn't work.
>>
>>  -jake
>>
>

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Benson Margulies <bi...@gmail.com>.
There isn't a good solution for 0.5.

The code that calls setJarByClass has to pass a class that is NOT in
the lib directory, but rather in the unpacked classes. It's really
easy to build a hadoop job with Mahout that violates that rule due to
all the static methods that create jobs.

We seem to have a consensus to rework all the jobs as beans so that
this can be wrestled into control.



On Sun, May 8, 2011 at 6:16 PM, Jake Mannix <ja...@gmail.com> wrote:
> On Sun, May 8, 2011 at 2:58 PM, Sean Owen <sr...@gmail.com> wrote:
>
>> If I recall the last discussion on this correctly --
>>
>> No you don't want to put anything in Hadoop's lib/ directory. Even if
>> you can, that's not the "right" way.
>> You want to use the job file indeed, which should contain all dependencies.
>> However, it packages dependencies as jars-in-the-jar, which doesn't
>> work for Hadoop.
>>
>
> I thought that hadoop was totally fine with jars inside of the jar, if
> they're
> in the lib directory?
>
>
>> I think if you modify the Maven build to just repackage all classes
>> into the main jar, it works. It works for me at least.
>>
>
> Clearly we're not expecting people to do this.  I wasn't even running with
> special new classes, it wasn't finding *Vector* - if this doesn't work on
> a real cluster, then most of our entire codebase (which requires
> mahout-math) doesn't work.
>
>  -jake
>

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Jake Mannix <ja...@gmail.com>.
On Sun, May 8, 2011 at 2:58 PM, Sean Owen <sr...@gmail.com> wrote:

> If I recall the last discussion on this correctly --
>
> No you don't want to put anything in Hadoop's lib/ directory. Even if
> you can, that's not the "right" way.
> You want to use the job file indeed, which should contain all dependencies.
> However, it packages dependencies as jars-in-the-jar, which doesn't
> work for Hadoop.
>

I thought that hadoop was totally fine with jars inside of the jar, if
they're
in the lib directory?


> I think if you modify the Maven build to just repackage all classes
> into the main jar, it works. It works for me at least.
>

Clearly we're not expecting people to do this.  I wasn't even running with
special new classes, it wasn't finding *Vector* - if this doesn't work on
a real cluster, then most of our entire codebase (which requires
mahout-math) doesn't work.

  -jake

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Jake Mannix <ja...@gmail.com>.
And what was the solution to the "setJarByClass" discussion, do you
remember?

  -jake

On Sun, May 8, 2011 at 3:12 PM, Benson Margulies <bi...@gmail.com>wrote:

> I believe that the bottom of this is the problems with setJarByClass
> that were documented some weeks ago.
>
> On Sun, May 8, 2011 at 5:58 PM, Sean Owen <sr...@gmail.com> wrote:
> > If I recall the last discussion on this correctly --
> >
> > No you don't want to put anything in Hadoop's lib/ directory. Even if
> > you can, that's not the "right" way.
> > You want to use the job file indeed, which should contain all
> dependencies.
> > However, it packages dependencies as jars-in-the-jar, which doesn't
> > work for Hadoop.
> >
> > I think if you modify the Maven build to just repackage all classes
> > into the main jar, it works. It works for me at least.
> >
> > And then I forget why we don't just do that in the build.
> >
> > On Sun, May 8, 2011 at 10:56 PM, Jake Mannix <ja...@gmail.com>
> wrote:
> >> On Sun, May 8, 2011 at 2:01 PM, EDUARDO ANTONIO BUITRAGO ZAPATA <
> >> eduardobuitrago@gmail.com> wrote:
> >>
> >>> you should try copying the mahout-math-(YOURVERSION).jar into the lib
> >>> folder
> >>> in the $HADOOP_HOME.
> >>>
> >>
> >> That works just fine if you're on a cluster where you have privileges to
> do
> >> this.  The mahout-math-0.5-SHAPSHOT.jar
> >> is in the lib directory of the mahout-examples-0.5-SNAPSHOT-job.jar, so
> it
> >> *should* be finding it. :\
> >>
> >>  -jake
> >>
> >>
> >>>
> >>> 2011/5/8 Jake Mannix <ja...@gmail.com>
> >>>
> >>> > Running on the cluster, I'm hit with this again, on using a freshly
> built
> >>> > distribution (with mvn package -Prelease).  What is the solution we
> >>> always
> >>> > give people to deal with this?
> >>> >
> >>> > (just running things like "./bin/mahout svd -i <input> -o <output>
> etc...
> >>> > ")
> >>> >
> >>> >  -jake
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> EDUARDO BUITRAGO
> >>> Est. Msc. en Ingeniería - Sistemas y Computación - Universidad de los
> Andes
> >>> Ing. de Sistemas - Universidad Francisco de Paula Santander
> >>> Cisco Certified Network Associate - CCNA
> >>>
> >>
> >
>

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Benson Margulies <bi...@gmail.com>.
I believe that the bottom of this is the problems with setJarByClass
that were documented some weeks ago.

On Sun, May 8, 2011 at 5:58 PM, Sean Owen <sr...@gmail.com> wrote:
> If I recall the last discussion on this correctly --
>
> No you don't want to put anything in Hadoop's lib/ directory. Even if
> you can, that's not the "right" way.
> You want to use the job file indeed, which should contain all dependencies.
> However, it packages dependencies as jars-in-the-jar, which doesn't
> work for Hadoop.
>
> I think if you modify the Maven build to just repackage all classes
> into the main jar, it works. It works for me at least.
>
> And then I forget why we don't just do that in the build.
>
> On Sun, May 8, 2011 at 10:56 PM, Jake Mannix <ja...@gmail.com> wrote:
>> On Sun, May 8, 2011 at 2:01 PM, EDUARDO ANTONIO BUITRAGO ZAPATA <
>> eduardobuitrago@gmail.com> wrote:
>>
>>> you should try copying the mahout-math-(YOURVERSION).jar into the lib
>>> folder
>>> in the $HADOOP_HOME.
>>>
>>
>> That works just fine if you're on a cluster where you have privileges to do
>> this.  The mahout-math-0.5-SHAPSHOT.jar
>> is in the lib directory of the mahout-examples-0.5-SNAPSHOT-job.jar, so it
>> *should* be finding it. :\
>>
>>  -jake
>>
>>
>>>
>>> 2011/5/8 Jake Mannix <ja...@gmail.com>
>>>
>>> > Running on the cluster, I'm hit with this again, on using a freshly built
>>> > distribution (with mvn package -Prelease).  What is the solution we
>>> always
>>> > give people to deal with this?
>>> >
>>> > (just running things like "./bin/mahout svd -i <input> -o <output> etc...
>>> > ")
>>> >
>>> >  -jake
>>> >
>>>
>>>
>>>
>>> --
>>> EDUARDO BUITRAGO
>>> Est. Msc. en Ingeniería - Sistemas y Computación - Universidad de los Andes
>>> Ing. de Sistemas - Universidad Francisco de Paula Santander
>>> Cisco Certified Network Associate - CCNA
>>>
>>
>

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Sean Owen <sr...@gmail.com>.
If I recall the last discussion on this correctly --

No you don't want to put anything in Hadoop's lib/ directory. Even if
you can, that's not the "right" way.
You want to use the job file indeed, which should contain all dependencies.
However, it packages dependencies as jars-in-the-jar, which doesn't
work for Hadoop.

I think if you modify the Maven build to just repackage all classes
into the main jar, it works. It works for me at least.

And then I forget why we don't just do that in the build.

On Sun, May 8, 2011 at 10:56 PM, Jake Mannix <ja...@gmail.com> wrote:
> On Sun, May 8, 2011 at 2:01 PM, EDUARDO ANTONIO BUITRAGO ZAPATA <
> eduardobuitrago@gmail.com> wrote:
>
>> you should try copying the mahout-math-(YOURVERSION).jar into the lib
>> folder
>> in the $HADOOP_HOME.
>>
>
> That works just fine if you're on a cluster where you have privileges to do
> this.  The mahout-math-0.5-SHAPSHOT.jar
> is in the lib directory of the mahout-examples-0.5-SNAPSHOT-job.jar, so it
> *should* be finding it. :\
>
>  -jake
>
>
>>
>> 2011/5/8 Jake Mannix <ja...@gmail.com>
>>
>> > Running on the cluster, I'm hit with this again, on using a freshly built
>> > distribution (with mvn package -Prelease).  What is the solution we
>> always
>> > give people to deal with this?
>> >
>> > (just running things like "./bin/mahout svd -i <input> -o <output> etc...
>> > ")
>> >
>> >  -jake
>> >
>>
>>
>>
>> --
>> EDUARDO BUITRAGO
>> Est. Msc. en Ingeniería - Sistemas y Computación - Universidad de los Andes
>> Ing. de Sistemas - Universidad Francisco de Paula Santander
>> Cisco Certified Network Associate - CCNA
>>
>

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by Jake Mannix <ja...@gmail.com>.
On Sun, May 8, 2011 at 2:01 PM, EDUARDO ANTONIO BUITRAGO ZAPATA <
eduardobuitrago@gmail.com> wrote:

> you should try copying the mahout-math-(YOURVERSION).jar into the lib
> folder
> in the $HADOOP_HOME.
>

That works just fine if you're on a cluster where you have privileges to do
this.  The mahout-math-0.5-SHAPSHOT.jar
is in the lib directory of the mahout-examples-0.5-SNAPSHOT-job.jar, so it
*should* be finding it. :\

  -jake


>
> 2011/5/8 Jake Mannix <ja...@gmail.com>
>
> > Running on the cluster, I'm hit with this again, on using a freshly built
> > distribution (with mvn package -Prelease).  What is the solution we
> always
> > give people to deal with this?
> >
> > (just running things like "./bin/mahout svd -i <input> -o <output> etc...
> > ")
> >
> >  -jake
> >
>
>
>
> --
> EDUARDO BUITRAGO
> Est. Msc. en Ingeniería - Sistemas y Computación - Universidad de los Andes
> Ing. de Sistemas - Universidad Francisco de Paula Santander
> Cisco Certified Network Associate - CCNA
>

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

Posted by EDUARDO ANTONIO BUITRAGO ZAPATA <ed...@gmail.com>.
you should try copying the mahout-math-(YOURVERSION).jar into the lib folder
in the $HADOOP_HOME.

2011/5/8 Jake Mannix <ja...@gmail.com>

> Running on the cluster, I'm hit with this again, on using a freshly built
> distribution (with mvn package -Prelease).  What is the solution we always
> give people to deal with this?
>
> (just running things like "./bin/mahout svd -i <input> -o <output> etc...
> ")
>
>  -jake
>



-- 
EDUARDO BUITRAGO
Est. Msc. en Ingeniería - Sistemas y Computación - Universidad de los Andes
Ing. de Sistemas - Universidad Francisco de Paula Santander
Cisco Certified Network Associate - CCNA