You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Thomas Söhngen <th...@beluto.com> on 2011/02/08 17:34:35 UTC

Using several Mahout JarSteps in a JobFlow

Hi fellow data crunchers,

I am running a JobFlow with a step using 
"org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob" 
and a following step using 
"org.apache.mahout.cf.taste.hadoop.item.RecommenderJob". The first step 
works without problems, but the second one is throwing an Exception:

|Exception in thread"main"  org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory temp/itemIDIndex already exists and is not empty
	at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:124)
	at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:818)
	at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
	at org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.run(RecommenderJob.java:165)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.main(RecommenderJob.java:328)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

|

It looks like the second job is using the same temporal output 
directories as the first job. How can I avoid this? Or even better: If 
some of the tasks are already done and cached in the first step, how 
could I use them so that they don't have to be recomputed in the second 
step?

Best regards,
Thomas

PS: This is the actual JobFlow definition in JSON:

[
    [......],
   {
     "Name": "MR Step 2: Find similiar items",
     "HadoopJarStep": {
       "MainClass": 
"org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob",
       "Jar": "s3n://recommendertest/mahout-core/mahout-core-0.4-job.jar",
       "Args": [
          "--input",     
"s3n://recommendertest/data/<jobid>/aggregateWatched/",
          "--output",    
"s3n://recommendertest/data/<jobid>/similiarItems/",
          "--similarityClassname",    "SIMILARITY_PEARSON_CORRELATION",
          "--maxSimilaritiesPerItem",    "100"
       ]
     }
   },
   {
     "Name": "MR Step 3: Find items for user",
     "HadoopJarStep": {
       "MainClass": "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob",
       "Jar": "s3n://recommendertest/mahout-core/mahout-core-0.4-job.jar",
       "Args": [
          "--input",     
"s3n://recommendertest/data/<jobid>/aggregateWatched/",
          "--output",    
"s3n://recommendertest/data/<jobid>/userRecommendations/",
          "--similarityClassname",    "SIMILARITY_PEARSON_CORRELATION",
          "--numRecommendations",    "100"
       ]
     }
   }
]

||||


Re: Using several Mahout JarSteps in a JobFlow

Posted by Thomas Söhngen <th...@beluto.com>.
Hi Sebastian,

thank you very much, using the tempDir parameter fixed the problem.

As you mentioned, it would be really nice if there were a a single step, 
which puts out item recommendations for users as well as user-user and 
item-item similiarity. An alternative would be, to split the 
RecommenderJob class in different jobs, which rely on each others 
output. This would be even better for my case, because I am using AWS 
EMR and would have to do a manual copy out of hdfs if these information 
are not in the main output of the step, which would be much harder to 
script.

Best regards,
Thomas

Am 08.02.2011 17:46, schrieb Sebastian Schelter:
> Hi Thomas,
>
> you can also use the parameter --tempDir to explicitly point a job to a
> temp directory.
>
> By the way I recoginize that our users shouldn't need to execute both
> jobs like you do because the similar items computation is already
> contained in RecommenderJob, we should add an option that makes it write
> out the similar items in a nice form, so we can avoid having to run both
> jobs.
>
> I'm gonna create a ticket for this.
>
> --sebastian
>
>
> Am 08.02.2011 17:37, schrieb Sean Owen:
>> I would not run them in the same root directory / key prefix. Put them
>> both under different namespaces.
>>
>> On Tue, Feb 8, 2011 at 4:34 PM, Thomas Söhngen<th...@beluto.com>  wrote:
>>> Hi fellow data crunchers,
>>>
>>> I am running a JobFlow with a step using
>>> "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob" and a
>>> following step using
>>> "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob". The first step
>>> works without problems, but the second one is throwing an Exception:
>>>
>>> |Exception in thread"main"
>>>   org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
>>> temp/itemIDIndex already exists and is not empty
>>>         at
>>> org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:124)
>>>         at
>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:818)
>>>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
>>>         at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
>>>         at
>>> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.run(RecommenderJob.java:165)
>>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>         at
>>> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.main(RecommenderJob.java:328)
>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>         at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>         at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>>
>>> |
>>>
>>> It looks like the second job is using the same temporal output directories
>>> as the first job. How can I avoid this? Or even better: If some of the tasks
>>> are already done and cached in the first step, how could I use them so that
>>> they don't have to be recomputed in the second step?
>>>
>>> Best regards,
>>> Thomas
>>>
>>> PS: This is the actual JobFlow definition in JSON:
>>>
>>> [
>>>    [......],
>>>   {
>>>     "Name": "MR Step 2: Find similiar items",
>>>     "HadoopJarStep": {
>>>       "MainClass":
>>> "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob",
>>>       "Jar": "s3n://recommendertest/mahout-core/mahout-core-0.4-job.jar",
>>>       "Args": [
>>>          "--input",
>>> "s3n://recommendertest/data/<jobid>/aggregateWatched/",
>>>          "--output",    "s3n://recommendertest/data/<jobid>/similiarItems/",
>>>          "--similarityClassname",    "SIMILARITY_PEARSON_CORRELATION",
>>>          "--maxSimilaritiesPerItem",    "100"
>>>       ]
>>>     }
>>>   },
>>>   {
>>>     "Name": "MR Step 3: Find items for user",
>>>     "HadoopJarStep": {
>>>       "MainClass": "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob",
>>>       "Jar": "s3n://recommendertest/mahout-core/mahout-core-0.4-job.jar",
>>>       "Args": [
>>>          "--input",
>>> "s3n://recommendertest/data/<jobid>/aggregateWatched/",
>>>          "--output",
>>>   "s3n://recommendertest/data/<jobid>/userRecommendations/",
>>>          "--similarityClassname",    "SIMILARITY_PEARSON_CORRELATION",
>>>          "--numRecommendations",    "100"
>>>       ]
>>>     }
>>>   }
>>> ]
>>>
>>> ||||
>>>
>>>

Re: Using several Mahout JarSteps in a JobFlow

Posted by Sebastian Schelter <ss...@apache.org>.
Hi Thomas,

you can also use the parameter --tempDir to explicitly point a job to a
temp directory.

By the way I recoginize that our users shouldn't need to execute both
jobs like you do because the similar items computation is already
contained in RecommenderJob, we should add an option that makes it write
out the similar items in a nice form, so we can avoid having to run both
jobs.

I'm gonna create a ticket for this.

--sebastian


Am 08.02.2011 17:37, schrieb Sean Owen:
> I would not run them in the same root directory / key prefix. Put them
> both under different namespaces.
> 
> On Tue, Feb 8, 2011 at 4:34 PM, Thomas Söhngen <th...@beluto.com> wrote:
>> Hi fellow data crunchers,
>>
>> I am running a JobFlow with a step using
>> "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob" and a
>> following step using
>> "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob". The first step
>> works without problems, but the second one is throwing an Exception:
>>
>> |Exception in thread"main"
>>  org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
>> temp/itemIDIndex already exists and is not empty
>>        at
>> org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:124)
>>        at
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:818)
>>        at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
>>        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
>>        at
>> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.run(RecommenderJob.java:165)
>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>        at
>> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.main(RecommenderJob.java:328)
>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>        at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>        at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>
>> |
>>
>> It looks like the second job is using the same temporal output directories
>> as the first job. How can I avoid this? Or even better: If some of the tasks
>> are already done and cached in the first step, how could I use them so that
>> they don't have to be recomputed in the second step?
>>
>> Best regards,
>> Thomas
>>
>> PS: This is the actual JobFlow definition in JSON:
>>
>> [
>>   [......],
>>  {
>>    "Name": "MR Step 2: Find similiar items",
>>    "HadoopJarStep": {
>>      "MainClass":
>> "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob",
>>      "Jar": "s3n://recommendertest/mahout-core/mahout-core-0.4-job.jar",
>>      "Args": [
>>         "--input",
>> "s3n://recommendertest/data/<jobid>/aggregateWatched/",
>>         "--output",    "s3n://recommendertest/data/<jobid>/similiarItems/",
>>         "--similarityClassname",    "SIMILARITY_PEARSON_CORRELATION",
>>         "--maxSimilaritiesPerItem",    "100"
>>      ]
>>    }
>>  },
>>  {
>>    "Name": "MR Step 3: Find items for user",
>>    "HadoopJarStep": {
>>      "MainClass": "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob",
>>      "Jar": "s3n://recommendertest/mahout-core/mahout-core-0.4-job.jar",
>>      "Args": [
>>         "--input",
>> "s3n://recommendertest/data/<jobid>/aggregateWatched/",
>>         "--output",
>>  "s3n://recommendertest/data/<jobid>/userRecommendations/",
>>         "--similarityClassname",    "SIMILARITY_PEARSON_CORRELATION",
>>         "--numRecommendations",    "100"
>>      ]
>>    }
>>  }
>> ]
>>
>> ||||
>>
>>


Re: Using several Mahout JarSteps in a JobFlow

Posted by Sean Owen <sr...@gmail.com>.
I would not run them in the same root directory / key prefix. Put them
both under different namespaces.

On Tue, Feb 8, 2011 at 4:34 PM, Thomas Söhngen <th...@beluto.com> wrote:
> Hi fellow data crunchers,
>
> I am running a JobFlow with a step using
> "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob" and a
> following step using
> "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob". The first step
> works without problems, but the second one is throwing an Exception:
>
> |Exception in thread"main"
>  org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
> temp/itemIDIndex already exists and is not empty
>        at
> org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:124)
>        at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:818)
>        at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
>        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
>        at
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.run(RecommenderJob.java:165)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.main(RecommenderJob.java:328)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
> |
>
> It looks like the second job is using the same temporal output directories
> as the first job. How can I avoid this? Or even better: If some of the tasks
> are already done and cached in the first step, how could I use them so that
> they don't have to be recomputed in the second step?
>
> Best regards,
> Thomas
>
> PS: This is the actual JobFlow definition in JSON:
>
> [
>   [......],
>  {
>    "Name": "MR Step 2: Find similiar items",
>    "HadoopJarStep": {
>      "MainClass":
> "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob",
>      "Jar": "s3n://recommendertest/mahout-core/mahout-core-0.4-job.jar",
>      "Args": [
>         "--input",
> "s3n://recommendertest/data/<jobid>/aggregateWatched/",
>         "--output",    "s3n://recommendertest/data/<jobid>/similiarItems/",
>         "--similarityClassname",    "SIMILARITY_PEARSON_CORRELATION",
>         "--maxSimilaritiesPerItem",    "100"
>      ]
>    }
>  },
>  {
>    "Name": "MR Step 3: Find items for user",
>    "HadoopJarStep": {
>      "MainClass": "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob",
>      "Jar": "s3n://recommendertest/mahout-core/mahout-core-0.4-job.jar",
>      "Args": [
>         "--input",
> "s3n://recommendertest/data/<jobid>/aggregateWatched/",
>         "--output",
>  "s3n://recommendertest/data/<jobid>/userRecommendations/",
>         "--similarityClassname",    "SIMILARITY_PEARSON_CORRELATION",
>         "--numRecommendations",    "100"
>      ]
>    }
>  }
> ]
>
> ||||
>
>