You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Federico Castanedo <fc...@inf.uc3m.es> on 2011/02/03 17:58:22 UTC

Re: LDA in Mahout

Hi,

Joined a bit late this discussion, but, what about the perplexity measure as
reported on section 7.1. of Blei's LDA paper. it seems to be the metric
which is commonly used to obtain the best value of "k" (topics) when
training a LDA model.

bests,
Federico

2011/1/4 Jake Mannix <ja...@gmail.com>

> Saying we have hashing is different than saying we know what will happen to
> an algorithm once its running over hashed features (as the continuing work
> on our Stochastic SVD demonstrates).
>
> I can certainly try to run LDA over a hashed vector set, but I'm not sure
> what criteria for correctness / quality of the topic model I should use if
> I
> do.
>
>  -jake
>
> On Jan 4, 2011 7:21 AM, "Robin Anil" <ro...@gmail.com> wrote:
>
> We already have the second part - the hashing trick. Thanks to Ted, and he
> has a mechanism to partially reverse engineer the feature as well. You
> might
> be able to drop it directly in the job itself or even vectorize and then
> run
> LDA.
>
> Robin
>
> On Tue, Jan 4, 2011 at 8:44 PM, Jake Mannix <ja...@gmail.com> wrote:
> >
> Hey Robin, > > Vowp...
>

Re: Using several Mahout JarSteps in a JobFlow

Posted by Thomas Söhngen <th...@beluto.com>.

Hi Sebastian,

thank you very much, using the tempDir parameter fixed the problem.

As you mentioned, it would be really nice if there were a a single step, 
which puts out item recommendations for users as well as user-user and 
item-item similiarity. An alternative would be, to split the 
RecommenderJob class in different jobs, which rely on each others 
output. This would be even better for my case, because I am using AWS 
EMR and would have to do a manual copy out of hdfs if these information 
are not in the main output of the step, which would be much harder to 
script.

Best regards,
Thomas

Am 08.02.2011 17:46, schrieb Sebastian Schelter:
> Hi Thomas,
>
> you can also use the parameter --tempDir to explicitly point a job to a
> temp directory.
>
> By the way I recoginize that our users shouldn't need to execute both
> jobs like you do because the similar items computation is already
> contained in RecommenderJob, we should add an option that makes it write
> out the similar items in a nice form, so we can avoid having to run both
> jobs.
>
> I'm gonna create a ticket for this.
>
> --sebastian
>
>
> Am 08.02.2011 17:37, schrieb Sean Owen:
>> I would not run them in the same root directory / key prefix. Put them
>> both under different namespaces.
>>
>> On Tue, Feb 8, 2011 at 4:34 PM, Thomas Söhngen<th...@beluto.com>  wrote:
>>> Hi fellow data crunchers,
>>>
>>> I am running a JobFlow with a step using
>>> "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob" and a
>>> following step using
>>> "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob". The first step
>>> works without problems, but the second one is throwing an Exception:
>>>
>>> |Exception in thread"main"
>>>   org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
>>> temp/itemIDIndex already exists and is not empty
>>>         at
>>> org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:124)
>>>         at
>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:818)
>>>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
>>>         at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
>>>         at
>>> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.run(RecommenderJob.java:165)
>>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>         at
>>> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.main(RecommenderJob.java:328)
>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>         at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>         at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>>
>>> |
>>>
>>> It looks like the second job is using the same temporal output directories
>>> as the first job. How can I avoid this? Or even better: If some of the tasks
>>> are already done and cached in the first step, how could I use them so that
>>> they don't have to be recomputed in the second step?
>>>
>>> Best regards,
>>> Thomas
>>>
>>> PS: This is the actual JobFlow definition in JSON:
>>>
>>> [
>>>    [......],
>>>   {
>>>     "Name": "MR Step 2: Find similiar items",
>>>     "HadoopJarStep": {
>>>       "MainClass":
>>> "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob",
>>>       "Jar": "s3n://recommendertest/mahout-core/mahout-core-0.4-job.jar",
>>>       "Args": [
>>>          "--input",
>>> "s3n://recommendertest/data/<jobid>/aggregateWatched/",
>>>          "--output",    "s3n://recommendertest/data/<jobid>/similiarItems/",
>>>          "--similarityClassname",    "SIMILARITY_PEARSON_CORRELATION",
>>>          "--maxSimilaritiesPerItem",    "100"
>>>       ]
>>>     }
>>>   },
>>>   {
>>>     "Name": "MR Step 3: Find items for user",
>>>     "HadoopJarStep": {
>>>       "MainClass": "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob",
>>>       "Jar": "s3n://recommendertest/mahout-core/mahout-core-0.4-job.jar",
>>>       "Args": [
>>>          "--input",
>>> "s3n://recommendertest/data/<jobid>/aggregateWatched/",
>>>          "--output",
>>>   "s3n://recommendertest/data/<jobid>/userRecommendations/",
>>>          "--similarityClassname",    "SIMILARITY_PEARSON_CORRELATION",
>>>          "--numRecommendations",    "100"
>>>       ]
>>>     }
>>>   }
>>> ]
>>>
>>> ||||
>>>
>>>

Re: Using several Mahout JarSteps in a JobFlow

Posted by Sebastian Schelter <ss...@apache.org>.

Hi Thomas,

you can also use the parameter --tempDir to explicitly point a job to a
temp directory.

By the way I recoginize that our users shouldn't need to execute both
jobs like you do because the similar items computation is already
contained in RecommenderJob, we should add an option that makes it write
out the similar items in a nice form, so we can avoid having to run both
jobs.

I'm gonna create a ticket for this.

--sebastian


Am 08.02.2011 17:37, schrieb Sean Owen:
> I would not run them in the same root directory / key prefix. Put them
> both under different namespaces.
> 
> On Tue, Feb 8, 2011 at 4:34 PM, Thomas Söhngen <th...@beluto.com> wrote:
>> Hi fellow data crunchers,
>>
>> I am running a JobFlow with a step using
>> "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob" and a
>> following step using
>> "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob". The first step
>> works without problems, but the second one is throwing an Exception:
>>
>> |Exception in thread"main"
>>  org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
>> temp/itemIDIndex already exists and is not empty
>>        at
>> org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:124)
>>        at
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:818)
>>        at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
>>        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
>>        at
>> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.run(RecommenderJob.java:165)
>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>        at
>> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.main(RecommenderJob.java:328)
>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>        at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>        at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>
>> |
>>
>> It looks like the second job is using the same temporal output directories
>> as the first job. How can I avoid this? Or even better: If some of the tasks
>> are already done and cached in the first step, how could I use them so that
>> they don't have to be recomputed in the second step?
>>
>> Best regards,
>> Thomas
>>
>> PS: This is the actual JobFlow definition in JSON:
>>
>> [
>>   [......],
>>  {
>>    "Name": "MR Step 2: Find similiar items",
>>    "HadoopJarStep": {
>>      "MainClass":
>> "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob",
>>      "Jar": "s3n://recommendertest/mahout-core/mahout-core-0.4-job.jar",
>>      "Args": [
>>         "--input",
>> "s3n://recommendertest/data/<jobid>/aggregateWatched/",
>>         "--output",    "s3n://recommendertest/data/<jobid>/similiarItems/",
>>         "--similarityClassname",    "SIMILARITY_PEARSON_CORRELATION",
>>         "--maxSimilaritiesPerItem",    "100"
>>      ]
>>    }
>>  },
>>  {
>>    "Name": "MR Step 3: Find items for user",
>>    "HadoopJarStep": {
>>      "MainClass": "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob",
>>      "Jar": "s3n://recommendertest/mahout-core/mahout-core-0.4-job.jar",
>>      "Args": [
>>         "--input",
>> "s3n://recommendertest/data/<jobid>/aggregateWatched/",
>>         "--output",
>>  "s3n://recommendertest/data/<jobid>/userRecommendations/",
>>         "--similarityClassname",    "SIMILARITY_PEARSON_CORRELATION",
>>         "--numRecommendations",    "100"
>>      ]
>>    }
>>  }
>> ]
>>
>> ||||
>>
>>

Re: Using several Mahout JarSteps in a JobFlow

Posted by Sean Owen <sr...@gmail.com>.

I would not run them in the same root directory / key prefix. Put them
both under different namespaces.

On Tue, Feb 8, 2011 at 4:34 PM, Thomas Söhngen <th...@beluto.com> wrote:
> Hi fellow data crunchers,
>
> I am running a JobFlow with a step using
> "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob" and a
> following step using
> "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob". The first step
> works without problems, but the second one is throwing an Exception:
>
> |Exception in thread"main"
>  org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
> temp/itemIDIndex already exists and is not empty
>        at
> org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:124)
>        at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:818)
>        at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
>        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
>        at
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.run(RecommenderJob.java:165)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.main(RecommenderJob.java:328)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
> |
>
> It looks like the second job is using the same temporal output directories
> as the first job. How can I avoid this? Or even better: If some of the tasks
> are already done and cached in the first step, how could I use them so that
> they don't have to be recomputed in the second step?
>
> Best regards,
> Thomas
>
> PS: This is the actual JobFlow definition in JSON:
>
> [
>   [......],
>  {
>    "Name": "MR Step 2: Find similiar items",
>    "HadoopJarStep": {
>      "MainClass":
> "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob",
>      "Jar": "s3n://recommendertest/mahout-core/mahout-core-0.4-job.jar",
>      "Args": [
>         "--input",
> "s3n://recommendertest/data/<jobid>/aggregateWatched/",
>         "--output",    "s3n://recommendertest/data/<jobid>/similiarItems/",
>         "--similarityClassname",    "SIMILARITY_PEARSON_CORRELATION",
>         "--maxSimilaritiesPerItem",    "100"
>      ]
>    }
>  },
>  {
>    "Name": "MR Step 3: Find items for user",
>    "HadoopJarStep": {
>      "MainClass": "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob",
>      "Jar": "s3n://recommendertest/mahout-core/mahout-core-0.4-job.jar",
>      "Args": [
>         "--input",
> "s3n://recommendertest/data/<jobid>/aggregateWatched/",
>         "--output",
>  "s3n://recommendertest/data/<jobid>/userRecommendations/",
>         "--similarityClassname",    "SIMILARITY_PEARSON_CORRELATION",
>         "--numRecommendations",    "100"
>      ]
>    }
>  }
> ]
>
> ||||
>
>

Using several Mahout JarSteps in a JobFlow

Posted by Thomas Söhngen <th...@beluto.com>.

Hi fellow data crunchers,

I am running a JobFlow with a step using 
"org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob" 
and a following step using 
"org.apache.mahout.cf.taste.hadoop.item.RecommenderJob". The first step 
works without problems, but the second one is throwing an Exception:

|Exception in thread"main"  org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory temp/itemIDIndex already exists and is not empty
	at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:124)
	at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:818)
	at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
	at org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.run(RecommenderJob.java:165)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.main(RecommenderJob.java:328)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

|

It looks like the second job is using the same temporal output 
directories as the first job. How can I avoid this? Or even better: If 
some of the tasks are already done and cached in the first step, how 
could I use them so that they don't have to be recomputed in the second 
step?

Best regards,
Thomas

PS: This is the actual JobFlow definition in JSON:

[
    [......],
   {
     "Name": "MR Step 2: Find similiar items",
     "HadoopJarStep": {
       "MainClass": 
"org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob",
       "Jar": "s3n://recommendertest/mahout-core/mahout-core-0.4-job.jar",
       "Args": [
          "--input",     
"s3n://recommendertest/data/<jobid>/aggregateWatched/",
          "--output",    
"s3n://recommendertest/data/<jobid>/similiarItems/",
          "--similarityClassname",    "SIMILARITY_PEARSON_CORRELATION",
          "--maxSimilaritiesPerItem",    "100"
       ]
     }
   },
   {
     "Name": "MR Step 3: Find items for user",
     "HadoopJarStep": {
       "MainClass": "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob",
       "Jar": "s3n://recommendertest/mahout-core/mahout-core-0.4-job.jar",
       "Args": [
          "--input",     
"s3n://recommendertest/data/<jobid>/aggregateWatched/",
          "--output",    
"s3n://recommendertest/data/<jobid>/userRecommendations/",
          "--similarityClassname",    "SIMILARITY_PEARSON_CORRELATION",
          "--numRecommendations",    "100"
       ]
     }
   }
]

||||

Re: LDA in Mahout

Posted by Ted Dunning <te...@gmail.com>.

I agree here.  Perplexity is probably the best measure of whether LDA is
still capturing the information it needs.

On Thu, Feb 3, 2011 at 8:58 AM, Federico Castanedo <fc...@inf.uc3m.es>wrote:

> Hi,
>
> Joined a bit late this discussion, but, what about the perplexity measure
> as
> reported on section 7.1. of Blei's LDA paper. it seems to be the metric
> which is commonly used to obtain the best value of "k" (topics) when
> training a LDA model.
>
> bests,
> Federico
>
> 2011/1/4 Jake Mannix <ja...@gmail.com>
>
> > Saying we have hashing is different than saying we know what will happen
> to
> > an algorithm once its running over hashed features (as the continuing
> work
> > on our Stochastic SVD demonstrates).
> >
> > I can certainly try to run LDA over a hashed vector set, but I'm not sure
> > what criteria for correctness / quality of the topic model I should use
> if
> > I
> > do.
> >
> >  -jake
> >
> > On Jan 4, 2011 7:21 AM, "Robin Anil" <ro...@gmail.com> wrote:
> >
> > We already have the second part - the hashing trick. Thanks to Ted, and
> he
> > has a mechanism to partially reverse engineer the feature as well. You
> > might
> > be able to drop it directly in the job itself or even vectorize and then
> > run
> > LDA.
> >
> > Robin
> >
> > On Tue, Jan 4, 2011 at 8:44 PM, Jake Mannix <ja...@gmail.com>
> wrote:
> > >
> > Hey Robin, > > Vowp...
> >
>