You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Jake Mannix <ja...@gmail.com> on 2011/01/04 01:51:32 UTC

LDA in Mahout

Hey all,

  tl;dr Learning about how to work with, improve, and understand our LDA
impl.  If you don't care about dimensional reduction with LDA, stop here.

  Long time, not much see (although the excuse of "new job" starts to get
old after 6 months, so I'll try to stop using that one).  I've been starting
to do some large-scale dimensional reduction tests lately, and as things go
with me, I write a bunch of map-reduce code to do something fancy in one
place (SVD with both rows and columns scaling out to 10^8 or so), and end up
wanting to test against some other state of the art (e.g. LDA), and now I'm
all sidetracked: I want to understand better both a) how LDA works in
general, and how our impl scales.

  I've been making some minor changes (in particular, implementing
MAHOUT-458 <https://issues.apache.org/jira/browse/MAHOUT-458> among other
things, which seems to have been closed even though it was never committed,
nor was its functionality ever put into trunk in another patch, from what I
can tell), and noticed that one of the major slowdowns is probably due to
the fact that the p(word|topic) matrix is loaded via side-channel means on
every pass, on every node, and as has been mentioned in previous threads
(for example, Running LDA over
Wikipedia<http://markmail.org/message/ua5hckybpkj3stdl>),
this puts an absolute cap on the size of the possible vocabulary (numTerms *
numTopics * 8bytes  < memory per mapper), in addition to being a huge
slowdown: every mapper must load this data via HDFS after every iteration.

  To improve this, the scalable thing is to load only the columns of
p(word|topic) for the words in a document.  Doing a reduce-side join
(running effectively a transpose operation on the documents together with a
"project onto the word_id" mapper on the word/topic matrix), and doing
inference and parameter update in the reducer would accomplish this, but it
would have to be followed up by another map-reduce pass of an identity
mapper and a log-summing reducer (the old LDAReducer).  It's an extra
shuffle per MR pass, but the data transfer seems like it would be
phenomenally less, and should scale just perfectly now - you would never
hold anything in memory the size of numTerms (or numDocs), and instead be
limited by only docLength * numTopics in memory at the inference step.

  Thoughts on this, for anyone who's been playing around with the LDA code
(*crickets*...) ?

  Another matter, more on the theoretical side of LDA:  the algorithm
"prefers sparse results", from what I understand of the lore (you're
optimizing under an L_1 constraint instead of L_2, so...), but how can you
make this explicit?  The current implementation has some "noise" factor of a
minimal probability that most words have in any topic, and so when you run
the Reuters example (50k terms, 23k documents, 20 topics) and look at the
actual probabilities (not possible to do on trunk, but I'll put up a patch
shortly), is that that while the probability that topic_0 will generate one
of the top 175 terms is about 50%, the long tail is rather long: the
probability that topic_0 will generate one of the top 1000 terms is up to
83%, and the probability that topic_0 will generate one of the top 5000 is
finally 99.6%.

  This is probably the wrong way to look at what the "true" top terms for a
latent factor are, because they're probabilities, and if the total
dictionary size is 50,000, you'd expect that random noise would say
p(term|topic) >= 1/50,000, right?  In which case the right cutoff would be
"ignore all terms with probabilities less than (say) 10x this value".  Given
this constraint, it looks like on the Reuters data set, each topic ends up
with about 800 terms above this cutoff.  Does this sound right?

  The reason I ask, is that to help with scaling, we really shouldn't be
hanging onto probabilities like 2e-11 - they just don't help with anything,
and our representation at the end of the day, and more importantly: during
data transfer among mappers and reducers while actually doing the
calculation, would be much reduced.  More importantly, I've been lead to
believe by my more-in-the-know-about-LDA colleagues, that since LDA is so
L_1-centric, you really can try to throw tons of topics at it, and it
doesn't really overfit that much, because it just splits up the topics into
sub-topics, basically (the topics get more sparse).  You don't necessarily
get much better generalization errors, but not necessarily any worse either.
 Has anyone else seen this kind of effect?

  -jake

Re: LDA in Mahout

Posted by Ted Dunning <te...@gmail.com>.

On Thu, Jan 6, 2011 at 1:33 PM, Neal Richter <nr...@gmail.com> wrote:

>
> Have you looked at transductive learning (an algorithm within
> semi-supervised learning)?
>

Yes.

Variants on this are widely used in fraud modeling.  One variant is to
simply train on a small labeled set and then
use the output of that model to train on a much larger set.  This is classic
transduction.

A second variant is to build the model on training data and use that to
simply relabel the training data and build a second model.  This is
different from the first case in that the original data is re-used with the
new labels.  This works well when many cases are mis-marked and good
regularization on the first and second model allows obviously whacky
training data to be excluded.  The second model can then be much simpler
since it isn't being distracted by the goofy training examples.

> IMO it would be very interesting to see what degree a bit of human labeled
> data would improve LDA topic extraction.
>

Definitely interesting.  I have no opinion about useful.

>
> Essentially one can take a large body of unlabeled documents and augment
> with a smaller set of labeled documents. Under certain conditions the
> addition can greatly boost the accuracy of the assigned labels/topics.
>

Yes.  Definitely could help.

Re: LDA in Mahout

Posted by Neal Richter <nr...@gmail.com>.

Yes.. that is a good way to normalize and differentiate/interpret the dot
product or simple intersection-set-count.

Have you looked at transductive learning (an algorithm within
semi-supervised learning)?

IMO it would be very interesting to see what degree a bit of human labeled
data would improve LDA topic extraction.

Essentially one can take a large body of unlabeled documents and augment
with a smaller set of labeled documents. Under certain conditions the
addition can greatly boost the accuracy of the assigned labels/topics.

See the alg pseudo code here:
http://aicoder.blogspot.com/2010/10/review-of-to-rank-with-partially.html

<http://aicoder.blogspot.com/2010/10/review-of-to-rank-with-partially.html>
-Neal

On Thu, Jan 6, 2011 at 2:20 PM, Ted Dunning <te...@gmail.com> wrote:

> The only reasonable quick and dirty test of this sort is to look at the
> terms most related to each topic and heuristically assign human tags.
>
> Unless...
>
> Perhaps I misunderstood you in the first place.  If you include tags in the
> LDA training, then you can look at distance (aka 1-dot product) in LDA
> space
> between tags versus as a predictor of how often the tags cooccur.
>  Alternatively, you can look at dot product between test documents and the
> tags that are on the test document.  Then you can define AUC as the
> probability that tags that are actually present have higher dot product
> than
> randomly selected tags.  Higher AUC is good.
>
> On Thu, Jan 6, 2011 at 1:03 PM, Neal Richter <nr...@gmail.com> wrote:
>
> > I did not intent to propose a theoretically sound way to test LDA as an
> > extractor/labeler of human tags.  The intent was simple suggestion
> towards
> > doing a quick-n-dirty test to see what the overlap of LDA extracted
> topics
> > and human tags on a well tagged document set.
> >
>

Re: LDA in Mahout

Posted by Ted Dunning <te...@gmail.com>.

The only reasonable quick and dirty test of this sort is to look at the
terms most related to each topic and heuristically assign human tags.

Unless...

Perhaps I misunderstood you in the first place.  If you include tags in the
LDA training, then you can look at distance (aka 1-dot product) in LDA space
between tags versus as a predictor of how often the tags cooccur.
 Alternatively, you can look at dot product between test documents and the
tags that are on the test document.  Then you can define AUC as the
probability that tags that are actually present have higher dot product than
randomly selected tags.  Higher AUC is good.

On Thu, Jan 6, 2011 at 1:03 PM, Neal Richter <nr...@gmail.com> wrote:

> I did not intent to propose a theoretically sound way to test LDA as an
> extractor/labeler of human tags.  The intent was simple suggestion towards
> doing a quick-n-dirty test to see what the overlap of LDA extracted topics
> and human tags on a well tagged document set.
>

Re: LDA in Mahout

Posted by Neal Richter <nr...@gmail.com>.

>
>
> My point is exactly that this evaluation will lead to nonsense.  The size
> of
> the extracted topics vector isn't even necessarily the same as the size of
> the labels vector.  There is also no guarantee that it would be in the same
> order.
>
>
If order is not important in the comparison.  I'm proposing something simple
metric that is NOT great from a theory perspective.

Intersection(Document.LabelsVector, Document.ExtractedTopicsVector).Count()



> What you need is one extra step where you build a supervised classifier
> using the extracted topics vector to predict the label.  The accuracy of
> this supervised classifier is a measure of how well the extracted topics
> encodes the information in the labels.
>

Why not mix it in and perform transductive learning then?

I did not intent to propose a theoretically sound way to test LDA as an
extractor/labeler of human tags.  The intent was simple suggestion towards
doing a quick-n-dirty test to see what the overlap of LDA extracted topics
and human tags on a well tagged document set.

- Neal

Re: LDA in Mahout

Posted by Ted Dunning <te...@gmail.com>.

On Thu, Jan 6, 2011 at 11:08 AM, Neal Richter <nr...@gmail.com> wrote:

> > That said, your suggestion is a reasonable one.  If you use the LDA topic
> > distribution for each document as a feature vector for a supervised model
> > then it is pretty easy to argue that LDA distributions that give better
> > model performance are better at capturing content.  The supervised step
> is
> > necessary, however, since there is not guarantee that the LDA topics will
> > have a simple relationship to human assigned categories.
> >
>
> If one thinks of the LDA outputting a distribution of topics for a given
> document... then at some point a real decision is made to output N topic
> labels... it looks like a classifier now.
>

It is true that anything that takes features in and produces scores can be
called a classifier.  The key distinction is whether the model is derived in
a supervised or unsupervised fashion.  Models derived by supervised training
can be evaluated by holding out training data.  It is common (and sloppy) to
refer to models derived with supervised learning as classifiers and models
derived with unsupervised learning as clustering.  In this nomenclature, LDA
is a clustering algorithm.

Models derived by unsupervised training cannot be evaluated against held out
data that has assigned labels because there is no reason that the
unsupervised results should correlate in an obvious way to the desired
labels.

There are various figures of merit for unsupervised models, but the one that
I prefer is "how useful is the output of the unsupervised model?". The
simplest model of utility laying around in this case is whether the
unsupervised model produces features that can be used to build a supervised
model.  An unsupervised model that cannot be so used is not providing us
with usable information.  Hopefully, the supervised model is very simple,
possibly even just a one-to-one rearrangement of unsupervised scores.

>
> I'm suggesting that one can do a classification accuracy test of the LDA
> predicted label set with a set of human generated labels from tagging data.
>
> 1) Document.DataVector
> 2) Document.LabelsVector
> 3) Run LDA on Document.DataVector to generate
> Document.ExtractedTopicsVector
>
> Compute accuracy by comparing Document.LabelsVector
> to Document.ExtractedTopicsVector
>

My point is exactly that this evaluation will lead to nonsense.  The size of
the extracted topics vector isn't even necessarily the same as the size of
the labels vector.  There is also no guarantee that it would be in the same
order.

>
> There will be misses if the human labeled/tagged term or phrase does not
> exist within the document's text or metadata. LDA can't see these unless
> some augmentation/inference step is run on the document vector prior to LDA
> input.
>

Actually, you will likely get worse than random results.

What you need is one extra step where you build a supervised classifier
using the extracted topics vector to predict the label.  The accuracy of
this supervised classifier is a measure of how well the extracted topics
encodes the information in the labels.

Re: LDA in Mahout

Posted by Neal Richter <nr...@gmail.com>.

On Thu, Jan 6, 2011 at 9:22 AM, Ted Dunning <te...@gmail.com> wrote:

> The topics in LDA are not the same as topics in normal parlance.  They are
> abstract, internal probabilistic distributions.
>
>
Yes.. it also depends on the LDA-extracted topics existing in the document's
text or meta-data.


> That said, your suggestion is a reasonable one.  If you use the LDA topic
> distribution for each document as a feature vector for a supervised model
> then it is pretty easy to argue that LDA distributions that give better
> model performance are better at capturing content.  The supervised step is
> necessary, however, since there is not guarantee that the LDA topics will
> have a simple relationship to human assigned categories.
>

If one thinks of the LDA outputting a distribution of topics for a given
document... then at some point a real decision is made to output N topic
labels... it looks like a classifier now.

I'm suggesting that one can do a classification accuracy test of the LDA
predicted label set with a set of human generated labels from tagging data.

1) Document.DataVector
2) Document.LabelsVector
3) Run LDA on Document.DataVector to generate Document.ExtractedTopicsVector

Compute accuracy by comparing Document.LabelsVector
to Document.ExtractedTopicsVector

There will be misses if the human labeled/tagged term or phrase does not
exist within the document's text or metadata. LDA can't see these unless
some augmentation/inference step is run on the document vector prior to LDA
input.

So it's not really supervised as there is no training.... just the 2nd-stage
testing part of supervised learning.

- Neal



>
> On Wed, Jan 5, 2011 at 11:57 PM, Neal Richter <nr...@gmail.com> wrote:
>
> > What about gauging it's ability to predict the topics of labeled data?
> >
> > 1) Grab RSS feeds of blog posts and use the tags as labels
> > 2) Delicious bookmarks & their content versus user tags
> > 3) other examples abound...
> >
> > On Tue, Jan 4, 2011 at 10:33 AM, Jake Mannix <ja...@gmail.com>
> > wrote:
> >
> > > Saying we have hashing is different than saying we know what will
> happen
> > to
> > > an algorithm once its running over hashed features (as the continuing
> > work
> > > on our Stochastic SVD demonstrates).
> > >
> > > I can certainly try to run LDA over a hashed vector set, but I'm not
> sure
> > > what criteria for correctness / quality of the topic model I should use
> > if
> > > I
> > > do.
> > >
> > >  -jake
> > >
> > > On Jan 4, 2011 7:21 AM, "Robin Anil" <ro...@gmail.com> wrote:
> > >
> > > We already have the second part - the hashing trick. Thanks to Ted, and
> > he
> > > has a mechanism to partially reverse engineer the feature as well. You
> > > might
> > > be able to drop it directly in the job itself or even vectorize and
> then
> > > run
> > > LDA.
> > >
> > > Robin
> > >
> > > On Tue, Jan 4, 2011 at 8:44 PM, Jake Mannix <ja...@gmail.com>
> > wrote:
> > > >
> > > Hey Robin, > > Vowp...
> > >
> >
>

Re: LDA in Mahout

Posted by Ted Dunning <te...@gmail.com>.

The topics in LDA are not the same as topics in normal parlance.  They are
abstract, internal probabilistic distributions.

That said, your suggestion is a reasonable one.  If you use the LDA topic
distribution for each document as a feature vector for a supervised model
then it is pretty easy to argue that LDA distributions that give better
model performance are better at capturing content.  The supervised step is
necessary, however, since there is not guarantee that the LDA topics will
have a simple relationship to human assigned categories.

On Wed, Jan 5, 2011 at 11:57 PM, Neal Richter <nr...@gmail.com> wrote:

> What about gauging it's ability to predict the topics of labeled data?
>
> 1) Grab RSS feeds of blog posts and use the tags as labels
> 2) Delicious bookmarks & their content versus user tags
> 3) other examples abound...
>
> On Tue, Jan 4, 2011 at 10:33 AM, Jake Mannix <ja...@gmail.com>
> wrote:
>
> > Saying we have hashing is different than saying we know what will happen
> to
> > an algorithm once its running over hashed features (as the continuing
> work
> > on our Stochastic SVD demonstrates).
> >
> > I can certainly try to run LDA over a hashed vector set, but I'm not sure
> > what criteria for correctness / quality of the topic model I should use
> if
> > I
> > do.
> >
> >  -jake
> >
> > On Jan 4, 2011 7:21 AM, "Robin Anil" <ro...@gmail.com> wrote:
> >
> > We already have the second part - the hashing trick. Thanks to Ted, and
> he
> > has a mechanism to partially reverse engineer the feature as well. You
> > might
> > be able to drop it directly in the job itself or even vectorize and then
> > run
> > LDA.
> >
> > Robin
> >
> > On Tue, Jan 4, 2011 at 8:44 PM, Jake Mannix <ja...@gmail.com>
> wrote:
> > >
> > Hey Robin, > > Vowp...
> >
>

Re: LDA in Mahout

Posted by Neal Richter <nr...@gmail.com>.

What about gauging it's ability to predict the topics of labeled data?

1) Grab RSS feeds of blog posts and use the tags as labels
2) Delicious bookmarks & their content versus user tags
3) other examples abound...

On Tue, Jan 4, 2011 at 10:33 AM, Jake Mannix <ja...@gmail.com> wrote:

> Saying we have hashing is different than saying we know what will happen to
> an algorithm once its running over hashed features (as the continuing work
> on our Stochastic SVD demonstrates).
>
> I can certainly try to run LDA over a hashed vector set, but I'm not sure
> what criteria for correctness / quality of the topic model I should use if
> I
> do.
>
>  -jake
>
> On Jan 4, 2011 7:21 AM, "Robin Anil" <ro...@gmail.com> wrote:
>
> We already have the second part - the hashing trick. Thanks to Ted, and he
> has a mechanism to partially reverse engineer the feature as well. You
> might
> be able to drop it directly in the job itself or even vectorize and then
> run
> LDA.
>
> Robin
>
> On Tue, Jan 4, 2011 at 8:44 PM, Jake Mannix <ja...@gmail.com> wrote:
> >
> Hey Robin, > > Vowp...
>

Re: Using several Mahout JarSteps in a JobFlow

Posted by Thomas Söhngen <th...@beluto.com>.

Hi Sebastian,

thank you very much, using the tempDir parameter fixed the problem.

As you mentioned, it would be really nice if there were a a single step, 
which puts out item recommendations for users as well as user-user and 
item-item similiarity. An alternative would be, to split the 
RecommenderJob class in different jobs, which rely on each others 
output. This would be even better for my case, because I am using AWS 
EMR and would have to do a manual copy out of hdfs if these information 
are not in the main output of the step, which would be much harder to 
script.

Best regards,
Thomas

Am 08.02.2011 17:46, schrieb Sebastian Schelter:
> Hi Thomas,
>
> you can also use the parameter --tempDir to explicitly point a job to a
> temp directory.
>
> By the way I recoginize that our users shouldn't need to execute both
> jobs like you do because the similar items computation is already
> contained in RecommenderJob, we should add an option that makes it write
> out the similar items in a nice form, so we can avoid having to run both
> jobs.
>
> I'm gonna create a ticket for this.
>
> --sebastian
>
>
> Am 08.02.2011 17:37, schrieb Sean Owen:
>> I would not run them in the same root directory / key prefix. Put them
>> both under different namespaces.
>>
>> On Tue, Feb 8, 2011 at 4:34 PM, Thomas Söhngen<th...@beluto.com>  wrote:
>>> Hi fellow data crunchers,
>>>
>>> I am running a JobFlow with a step using
>>> "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob" and a
>>> following step using
>>> "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob". The first step
>>> works without problems, but the second one is throwing an Exception:
>>>
>>> |Exception in thread"main"
>>>   org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
>>> temp/itemIDIndex already exists and is not empty
>>>         at
>>> org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:124)
>>>         at
>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:818)
>>>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
>>>         at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
>>>         at
>>> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.run(RecommenderJob.java:165)
>>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>         at
>>> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.main(RecommenderJob.java:328)
>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>         at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>         at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>>
>>> |
>>>
>>> It looks like the second job is using the same temporal output directories
>>> as the first job. How can I avoid this? Or even better: If some of the tasks
>>> are already done and cached in the first step, how could I use them so that
>>> they don't have to be recomputed in the second step?
>>>
>>> Best regards,
>>> Thomas
>>>
>>> PS: This is the actual JobFlow definition in JSON:
>>>
>>> [
>>>    [......],
>>>   {
>>>     "Name": "MR Step 2: Find similiar items",
>>>     "HadoopJarStep": {
>>>       "MainClass":
>>> "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob",
>>>       "Jar": "s3n://recommendertest/mahout-core/mahout-core-0.4-job.jar",
>>>       "Args": [
>>>          "--input",
>>> "s3n://recommendertest/data/<jobid>/aggregateWatched/",
>>>          "--output",    "s3n://recommendertest/data/<jobid>/similiarItems/",
>>>          "--similarityClassname",    "SIMILARITY_PEARSON_CORRELATION",
>>>          "--maxSimilaritiesPerItem",    "100"
>>>       ]
>>>     }
>>>   },
>>>   {
>>>     "Name": "MR Step 3: Find items for user",
>>>     "HadoopJarStep": {
>>>       "MainClass": "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob",
>>>       "Jar": "s3n://recommendertest/mahout-core/mahout-core-0.4-job.jar",
>>>       "Args": [
>>>          "--input",
>>> "s3n://recommendertest/data/<jobid>/aggregateWatched/",
>>>          "--output",
>>>   "s3n://recommendertest/data/<jobid>/userRecommendations/",
>>>          "--similarityClassname",    "SIMILARITY_PEARSON_CORRELATION",
>>>          "--numRecommendations",    "100"
>>>       ]
>>>     }
>>>   }
>>> ]
>>>
>>> ||||
>>>
>>>

Re: Using several Mahout JarSteps in a JobFlow

Posted by Sebastian Schelter <ss...@apache.org>.

Hi Thomas,

you can also use the parameter --tempDir to explicitly point a job to a
temp directory.

By the way I recoginize that our users shouldn't need to execute both
jobs like you do because the similar items computation is already
contained in RecommenderJob, we should add an option that makes it write
out the similar items in a nice form, so we can avoid having to run both
jobs.

I'm gonna create a ticket for this.

--sebastian


Am 08.02.2011 17:37, schrieb Sean Owen:
> I would not run them in the same root directory / key prefix. Put them
> both under different namespaces.
> 
> On Tue, Feb 8, 2011 at 4:34 PM, Thomas Söhngen <th...@beluto.com> wrote:
>> Hi fellow data crunchers,
>>
>> I am running a JobFlow with a step using
>> "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob" and a
>> following step using
>> "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob". The first step
>> works without problems, but the second one is throwing an Exception:
>>
>> |Exception in thread"main"
>>  org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
>> temp/itemIDIndex already exists and is not empty
>>        at
>> org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:124)
>>        at
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:818)
>>        at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
>>        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
>>        at
>> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.run(RecommenderJob.java:165)
>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>        at
>> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.main(RecommenderJob.java:328)
>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>        at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>        at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>
>> |
>>
>> It looks like the second job is using the same temporal output directories
>> as the first job. How can I avoid this? Or even better: If some of the tasks
>> are already done and cached in the first step, how could I use them so that
>> they don't have to be recomputed in the second step?
>>
>> Best regards,
>> Thomas
>>
>> PS: This is the actual JobFlow definition in JSON:
>>
>> [
>>   [......],
>>  {
>>    "Name": "MR Step 2: Find similiar items",
>>    "HadoopJarStep": {
>>      "MainClass":
>> "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob",
>>      "Jar": "s3n://recommendertest/mahout-core/mahout-core-0.4-job.jar",
>>      "Args": [
>>         "--input",
>> "s3n://recommendertest/data/<jobid>/aggregateWatched/",
>>         "--output",    "s3n://recommendertest/data/<jobid>/similiarItems/",
>>         "--similarityClassname",    "SIMILARITY_PEARSON_CORRELATION",
>>         "--maxSimilaritiesPerItem",    "100"
>>      ]
>>    }
>>  },
>>  {
>>    "Name": "MR Step 3: Find items for user",
>>    "HadoopJarStep": {
>>      "MainClass": "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob",
>>      "Jar": "s3n://recommendertest/mahout-core/mahout-core-0.4-job.jar",
>>      "Args": [
>>         "--input",
>> "s3n://recommendertest/data/<jobid>/aggregateWatched/",
>>         "--output",
>>  "s3n://recommendertest/data/<jobid>/userRecommendations/",
>>         "--similarityClassname",    "SIMILARITY_PEARSON_CORRELATION",
>>         "--numRecommendations",    "100"
>>      ]
>>    }
>>  }
>> ]
>>
>> ||||
>>
>>

Re: Using several Mahout JarSteps in a JobFlow

Posted by Sean Owen <sr...@gmail.com>.

I would not run them in the same root directory / key prefix. Put them
both under different namespaces.

On Tue, Feb 8, 2011 at 4:34 PM, Thomas Söhngen <th...@beluto.com> wrote:
> Hi fellow data crunchers,
>
> I am running a JobFlow with a step using
> "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob" and a
> following step using
> "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob". The first step
> works without problems, but the second one is throwing an Exception:
>
> |Exception in thread"main"
>  org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
> temp/itemIDIndex already exists and is not empty
>        at
> org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:124)
>        at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:818)
>        at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
>        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
>        at
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.run(RecommenderJob.java:165)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.main(RecommenderJob.java:328)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
> |
>
> It looks like the second job is using the same temporal output directories
> as the first job. How can I avoid this? Or even better: If some of the tasks
> are already done and cached in the first step, how could I use them so that
> they don't have to be recomputed in the second step?
>
> Best regards,
> Thomas
>
> PS: This is the actual JobFlow definition in JSON:
>
> [
>   [......],
>  {
>    "Name": "MR Step 2: Find similiar items",
>    "HadoopJarStep": {
>      "MainClass":
> "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob",
>      "Jar": "s3n://recommendertest/mahout-core/mahout-core-0.4-job.jar",
>      "Args": [
>         "--input",
> "s3n://recommendertest/data/<jobid>/aggregateWatched/",
>         "--output",    "s3n://recommendertest/data/<jobid>/similiarItems/",
>         "--similarityClassname",    "SIMILARITY_PEARSON_CORRELATION",
>         "--maxSimilaritiesPerItem",    "100"
>      ]
>    }
>  },
>  {
>    "Name": "MR Step 3: Find items for user",
>    "HadoopJarStep": {
>      "MainClass": "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob",
>      "Jar": "s3n://recommendertest/mahout-core/mahout-core-0.4-job.jar",
>      "Args": [
>         "--input",
> "s3n://recommendertest/data/<jobid>/aggregateWatched/",
>         "--output",
>  "s3n://recommendertest/data/<jobid>/userRecommendations/",
>         "--similarityClassname",    "SIMILARITY_PEARSON_CORRELATION",
>         "--numRecommendations",    "100"
>      ]
>    }
>  }
> ]
>
> ||||
>
>

Using several Mahout JarSteps in a JobFlow

Posted by Thomas Söhngen <th...@beluto.com>.

Hi fellow data crunchers,

I am running a JobFlow with a step using 
"org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob" 
and a following step using 
"org.apache.mahout.cf.taste.hadoop.item.RecommenderJob". The first step 
works without problems, but the second one is throwing an Exception:

|Exception in thread"main"  org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory temp/itemIDIndex already exists and is not empty
	at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:124)
	at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:818)
	at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
	at org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.run(RecommenderJob.java:165)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.main(RecommenderJob.java:328)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

|

It looks like the second job is using the same temporal output 
directories as the first job. How can I avoid this? Or even better: If 
some of the tasks are already done and cached in the first step, how 
could I use them so that they don't have to be recomputed in the second 
step?

Best regards,
Thomas

PS: This is the actual JobFlow definition in JSON:

[
    [......],
   {
     "Name": "MR Step 2: Find similiar items",
     "HadoopJarStep": {
       "MainClass": 
"org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob",
       "Jar": "s3n://recommendertest/mahout-core/mahout-core-0.4-job.jar",
       "Args": [
          "--input",     
"s3n://recommendertest/data/<jobid>/aggregateWatched/",
          "--output",    
"s3n://recommendertest/data/<jobid>/similiarItems/",
          "--similarityClassname",    "SIMILARITY_PEARSON_CORRELATION",
          "--maxSimilaritiesPerItem",    "100"
       ]
     }
   },
   {
     "Name": "MR Step 3: Find items for user",
     "HadoopJarStep": {
       "MainClass": "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob",
       "Jar": "s3n://recommendertest/mahout-core/mahout-core-0.4-job.jar",
       "Args": [
          "--input",     
"s3n://recommendertest/data/<jobid>/aggregateWatched/",
          "--output",    
"s3n://recommendertest/data/<jobid>/userRecommendations/",
          "--similarityClassname",    "SIMILARITY_PEARSON_CORRELATION",
          "--numRecommendations",    "100"
       ]
     }
   }
]

||||

Re: LDA in Mahout

Posted by Ted Dunning <te...@gmail.com>.

I agree here.  Perplexity is probably the best measure of whether LDA is
still capturing the information it needs.

On Thu, Feb 3, 2011 at 8:58 AM, Federico Castanedo <fc...@inf.uc3m.es>wrote:

> Hi,
>
> Joined a bit late this discussion, but, what about the perplexity measure
> as
> reported on section 7.1. of Blei's LDA paper. it seems to be the metric
> which is commonly used to obtain the best value of "k" (topics) when
> training a LDA model.
>
> bests,
> Federico
>
> 2011/1/4 Jake Mannix <ja...@gmail.com>
>
> > Saying we have hashing is different than saying we know what will happen
> to
> > an algorithm once its running over hashed features (as the continuing
> work
> > on our Stochastic SVD demonstrates).
> >
> > I can certainly try to run LDA over a hashed vector set, but I'm not sure
> > what criteria for correctness / quality of the topic model I should use
> if
> > I
> > do.
> >
> >  -jake
> >
> > On Jan 4, 2011 7:21 AM, "Robin Anil" <ro...@gmail.com> wrote:
> >
> > We already have the second part - the hashing trick. Thanks to Ted, and
> he
> > has a mechanism to partially reverse engineer the feature as well. You
> > might
> > be able to drop it directly in the job itself or even vectorize and then
> > run
> > LDA.
> >
> > Robin
> >
> > On Tue, Jan 4, 2011 at 8:44 PM, Jake Mannix <ja...@gmail.com>
> wrote:
> > >
> > Hey Robin, > > Vowp...
> >
>

Re: LDA in Mahout

Posted by Federico Castanedo <fc...@inf.uc3m.es>.

Hi,

Joined a bit late this discussion, but, what about the perplexity measure as
reported on section 7.1. of Blei's LDA paper. it seems to be the metric
which is commonly used to obtain the best value of "k" (topics) when
training a LDA model.

bests,
Federico

2011/1/4 Jake Mannix <ja...@gmail.com>

> Saying we have hashing is different than saying we know what will happen to
> an algorithm once its running over hashed features (as the continuing work
> on our Stochastic SVD demonstrates).
>
> I can certainly try to run LDA over a hashed vector set, but I'm not sure
> what criteria for correctness / quality of the topic model I should use if
> I
> do.
>
>  -jake
>
> On Jan 4, 2011 7:21 AM, "Robin Anil" <ro...@gmail.com> wrote:
>
> We already have the second part - the hashing trick. Thanks to Ted, and he
> has a mechanism to partially reverse engineer the feature as well. You
> might
> be able to drop it directly in the job itself or even vectorize and then
> run
> LDA.
>
> Robin
>
> On Tue, Jan 4, 2011 at 8:44 PM, Jake Mannix <ja...@gmail.com> wrote:
> >
> Hey Robin, > > Vowp...
>

Re: LDA in Mahout

Posted by Jake Mannix <ja...@gmail.com>.

Saying we have hashing is different than saying we know what will happen to
an algorithm once its running over hashed features (as the continuing work
on our Stochastic SVD demonstrates).

I can certainly try to run LDA over a hashed vector set, but I'm not sure
what criteria for correctness / quality of the topic model I should use if I
do.

  -jake

On Jan 4, 2011 7:21 AM, "Robin Anil" <ro...@gmail.com> wrote:

We already have the second part - the hashing trick. Thanks to Ted, and he
has a mechanism to partially reverse engineer the feature as well. You might
be able to drop it directly in the job itself or even vectorize and then run
LDA.

Robin

On Tue, Jan 4, 2011 at 8:44 PM, Jake Mannix <ja...@gmail.com> wrote: >
Hey Robin, > > Vowp...

Re: LDA in Mahout

Posted by Robin Anil <ro...@gmail.com>.

We already have the second part - the hashing trick. Thanks to Ted, and he
has a mechanism to partially reverse engineer the feature as well. You might
be able to drop it directly in the job itself or even vectorize and then run
LDA.

Robin


On Tue, Jan 4, 2011 at 8:44 PM, Jake Mannix <ja...@gmail.com> wrote:

> Hey Robin,
>
>  Vowpal Wabbit is scalable for numDocs by being a streaming system, and
> scalable for numFeatures by using hashing, and for time by being blazingly
> fast.
>
>   I'm unfortunately just a novice LDA coder, so my attempts around
> deciphering VW's LDA impl (to see if there is anything we can learn from it
> which we aren't doing yet) have been... slow.
>
>  One thing we could do is write a streaming form of our current MR LDA, and
> see at what scale it actually starts to help.
>
>  -jake
>
> On Jan 3, 2011 9:33 PM, "Robin Anil" <ro...@gmail.com> wrote:
>
> Jake, take a look at Vowpal Wabbit 5.0. I saw an incremental LDA
> implementation there. Might be scalable
>
> On Tue, Jan 4, 2011 at 6:21 AM, Jake Mannix <ja...@gmail.com> wrote:
> >
> Hey all, > > tl;dr ...
> > MAHOUT-458 <https://issues.apache.org/jira/browse/MAHOUT-458> among
> other
>
> > things, which seems to have been closed even though it was never
> committed, > nor was its function...
> > Wikipedia<http://markmail.org/message/ua5hckybpkj3stdl>),
>
> > this puts an absolute cap on the size of the possible vocabulary
> (numTerms
> > * > numTopics * 8byte...
>

Re: LDA in Mahout

Posted by Jake Mannix <ja...@gmail.com>.

Hey Robin,

  Vowpal Wabbit is scalable for numDocs by being a streaming system, and
scalable for numFeatures by using hashing, and for time by being blazingly
fast.

   I'm unfortunately just a novice LDA coder, so my attempts around
deciphering VW's LDA impl (to see if there is anything we can learn from it
which we aren't doing yet) have been... slow.

  One thing we could do is write a streaming form of our current MR LDA, and
see at what scale it actually starts to help.

  -jake

On Jan 3, 2011 9:33 PM, "Robin Anil" <ro...@gmail.com> wrote:

Jake, take a look at Vowpal Wabbit 5.0. I saw an incremental LDA
implementation there. Might be scalable

On Tue, Jan 4, 2011 at 6:21 AM, Jake Mannix <ja...@gmail.com> wrote: >
Hey all, > > tl;dr ...
> MAHOUT-458 <https://issues.apache.org/jira/browse/MAHOUT-458> among other

> things, which seems to have been closed even though it was never
committed, > nor was its function...
> Wikipedia<http://markmail.org/message/ua5hckybpkj3stdl>),

> this puts an absolute cap on the size of the possible vocabulary (numTerms
> * > numTopics * 8byte...

Re: LDA in Mahout

Posted by Robin Anil <ro...@gmail.com>.

Jake, take a look at Vowpal Wabbit 5.0. I saw an incremental LDA
implementation there. Might be scalable


On Tue, Jan 4, 2011 at 6:21 AM, Jake Mannix <ja...@gmail.com> wrote:

> Hey all,
>
>  tl;dr Learning about how to work with, improve, and understand our LDA
> impl.  If you don't care about dimensional reduction with LDA, stop here.
>
>  Long time, not much see (although the excuse of "new job" starts to get
> old after 6 months, so I'll try to stop using that one).  I've been
> starting
> to do some large-scale dimensional reduction tests lately, and as things go
> with me, I write a bunch of map-reduce code to do something fancy in one
> place (SVD with both rows and columns scaling out to 10^8 or so), and end
> up
> wanting to test against some other state of the art (e.g. LDA), and now I'm
> all sidetracked: I want to understand better both a) how LDA works in
> general, and how our impl scales.
>
>  I've been making some minor changes (in particular, implementing
> MAHOUT-458 <https://issues.apache.org/jira/browse/MAHOUT-458> among other
> things, which seems to have been closed even though it was never committed,
> nor was its functionality ever put into trunk in another patch, from what I
> can tell), and noticed that one of the major slowdowns is probably due to
> the fact that the p(word|topic) matrix is loaded via side-channel means on
> every pass, on every node, and as has been mentioned in previous threads
> (for example, Running LDA over
> Wikipedia<http://markmail.org/message/ua5hckybpkj3stdl>),
> this puts an absolute cap on the size of the possible vocabulary (numTerms
> *
> numTopics * 8bytes  < memory per mapper), in addition to being a huge
> slowdown: every mapper must load this data via HDFS after every iteration.
>
>  To improve this, the scalable thing is to load only the columns of
> p(word|topic) for the words in a document.  Doing a reduce-side join
> (running effectively a transpose operation on the documents together with a
> "project onto the word_id" mapper on the word/topic matrix), and doing
> inference and parameter update in the reducer would accomplish this, but it
> would have to be followed up by another map-reduce pass of an identity
> mapper and a log-summing reducer (the old LDAReducer).  It's an extra
> shuffle per MR pass, but the data transfer seems like it would be
> phenomenally less, and should scale just perfectly now - you would never
> hold anything in memory the size of numTerms (or numDocs), and instead be
> limited by only docLength * numTopics in memory at the inference step.
>
>  Thoughts on this, for anyone who's been playing around with the LDA code
> (*crickets*...) ?
>
>  Another matter, more on the theoretical side of LDA:  the algorithm
> "prefers sparse results", from what I understand of the lore (you're
> optimizing under an L_1 constraint instead of L_2, so...), but how can you
> make this explicit?  The current implementation has some "noise" factor of
> a
> minimal probability that most words have in any topic, and so when you run
> the Reuters example (50k terms, 23k documents, 20 topics) and look at the
> actual probabilities (not possible to do on trunk, but I'll put up a patch
> shortly), is that that while the probability that topic_0 will generate one
> of the top 175 terms is about 50%, the long tail is rather long: the
> probability that topic_0 will generate one of the top 1000 terms is up to
> 83%, and the probability that topic_0 will generate one of the top 5000 is
> finally 99.6%.
>
>  This is probably the wrong way to look at what the "true" top terms for a
> latent factor are, because they're probabilities, and if the total
> dictionary size is 50,000, you'd expect that random noise would say
> p(term|topic) >= 1/50,000, right?  In which case the right cutoff would be
> "ignore all terms with probabilities less than (say) 10x this value".
>  Given
> this constraint, it looks like on the Reuters data set, each topic ends up
> with about 800 terms above this cutoff.  Does this sound right?
>
>  The reason I ask, is that to help with scaling, we really shouldn't be
> hanging onto probabilities like 2e-11 - they just don't help with anything,
> and our representation at the end of the day, and more importantly: during
> data transfer among mappers and reducers while actually doing the
> calculation, would be much reduced.  More importantly, I've been lead to
> believe by my more-in-the-know-about-LDA colleagues, that since LDA is so
> L_1-centric, you really can try to throw tons of topics at it, and it
> doesn't really overfit that much, because it just splits up the topics into
> sub-topics, basically (the topics get more sparse).  You don't necessarily
> get much better generalization errors, but not necessarily any worse
> either.
>  Has anyone else seen this kind of effect?
>
>  -jake
>