You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Tamas Jambor <ja...@googlemail.com> on 2010/05/02 19:48:46 UTC

new to hadoop

hi,

I have just started exploring the distributed version of mahout. I 
wanted to start with running the example job as follows:

  hadoop jar mahout-core-0.3.job 
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob --input 
testdata/test.txt --output output --tempDir temp --jarFile 
mahout-core-0.3.jar

but I couldn't find a parameter where I can specify the data set the 
recommender will use. I assume this is the reason why the job fails:

10/05/02 18:40:15 INFO mapred.JobClient: Task Id : 
attempt_201004291158_0018_m_000001_2, Status : FAILED
java.lang.ArrayIndexOutOfBoundsException: 1
         at 
org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:42)
         at 
org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:31)
         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
         at org.apache.hadoop.mapred.Child.main(Child.java:170)

thanks

Tamas

Re: new to hadoop

Posted by Sean Owen <sr...@gmail.com>.

I could be cheeky and point you to the book...
http://manning.com/owen

But I can also give you an overview, which is kind of what you see
from surveying the code.

RecommenderJob runs everything. It kicks off 5 different mapreduces, in order.

1. ItemIDIndexMapper / ItemIDIndexReducer
Since item IDs are longs, and vector indices are ints, we have to hash
the longs to ints, but also remember the reverse mapping for later.
That's all this does, write down the mapping.

2. ToItemPrefsMapper / ToUserVectorReducer
This converts the file of preferences into proper Vectors. Here, there
is one vector per user, and item IDs (hashed) are dimensions and
preference values are dimension values.

3. UserVectorToCooccurrenceMapper / UserVectorToCooccurrenceReducer
This is a somewhat complex step that does one thing -- counts
co-occurrence. It counts the number of times item A and item B
appeared in one user's preferences

4. CooccurrenceColumnWrapperMapper + UserVectorSplitterMapper /
PartialMultiplyReducer
This has two mappers which output one item's cooccurrences (one column
of the co-occurrence matrix), and all user preferences for that item,
in a clever way. The reducer multiplies those preference values by the
co-occurrence column, and outputs the result vectors, keyed by user.
These are part of the final recommendation vector for one user.

5. (IdentityMapper) / AggregateAndRecommendReducer
This adds up the partial vectors to make the final recommendation
vector for each user. The highest values are the recommended items.
The item index is mapped back to item ID and recommendations are
output.


That's it at a very high level, we can discuss more as you look at the code.

Sean

Re: new to hadoop

Posted by Tamas Jambor <ja...@googlemail.com>.

thank you, that solved the problem.

by the way, could you recommend a tutorial/documentation on how it works?

On 04/05/2010 19:32, Sean Owen wrote:
> I see the error I think, and committed a fix. If you can, try it out.
> You can let me know if it's still an issue directly if you care to.
>
> On Tue, May 4, 2010 at 5:10 PM, Sean Owen<sr...@gmail.com>  wrote:
>    
>> I'll look into that. At first glance I'm not sure how it happens. This
>> part only executes if the vector has at least MAX_PREFS_CONSIDERED
>> nonzero elements. Each has a count of at least 1. So the sum must
>> eventually exceed that value. The dirty bit of this computation is
>> that it really needs to look at only counts that actually exist in the
>> map, otherwise it'll iterate through many counts that don't exist.
>> That should be few, but could be the source of an issue.
>>

Re: new to hadoop

Posted by Sean Owen <sr...@gmail.com>.

I see the error I think, and committed a fix. If you can, try it out.
You can let me know if it's still an issue directly if you care to.

On Tue, May 4, 2010 at 5:10 PM, Sean Owen <sr...@gmail.com> wrote:
> I'll look into that. At first glance I'm not sure how it happens. This
> part only executes if the vector has at least MAX_PREFS_CONSIDERED
> nonzero elements. Each has a count of at least 1. So the sum must
> eventually exceed that value. The dirty bit of this computation is
> that it really needs to look at only counts that actually exist in the
> map, otherwise it'll iterate through many counts that don't exist.
> That should be few, but could be the source of an issue.

Re: new to hadoop

Posted by Sean Owen <sr...@gmail.com>.

I'll look into that. At first glance I'm not sure how it happens. This
part only executes if the vector has at least MAX_PREFS_CONSIDERED
nonzero elements. Each has a count of at least 1. So the sum must
eventually exceed that value. The dirty bit of this computation is
that it really needs to look at only counts that actually exist in the
map, otherwise it'll iterate through many counts that don't exist.
That should be few, but could be the source of an issue.

On Tue, May 4, 2010 at 4:23 PM, Tamas Jambor <ja...@googlemail.com> wrote:
> thanks. I have identified the infinite loop. It is in
>
> org.apache.mahout.cf.taste.hadoop.item.UserVectorToCooccurrenceMapper.maybePruneUserVector(UserVectorToCooccurrenceMapper.java:88)
>
> where the resultingSizeAtCutoff variable remains zero, it does not increase.
>
> Tamas
>
> On 04/05/2010 15:35, Vimal Mathew wrote:
>>
>> "kill -QUIT" will cause the stack trace to be dumped to stderr (which
>> is usually a log file). You can also try
>>
>> jstack [java process ID]
>>
>> to read the stack trace directly.
>>
>> You can use  the "jps" command to list Java processes running on a system.
>>
>>
>>
>> On Mon, May 3, 2010 at 7:26 PM, Sean Owen<sr...@gmail.com>  wrote:
>>
>>>
>>> I think the infinite loop theory is good.
>>>
>>> As a crude way to debug, you can log on to a worker machine, locate
>>> the java process that may be stuck, and:
>>>
>>> kill -QUIT [java process ID]
>>>
>>> This just makes it dump its stack for each thread. Do that a few times
>>> and you may easily spot an infinite loop situation because it will
>>> just be in the same place over and over.
>>>
>>> http://java.sun.com/developer/technicalArticles/Programming/Stacktrace/
>>>
>>> On Tue, May 4, 2010 at 12:15 AM, Tamas Jambor<ja...@googlemail.com>
>>>  wrote:
>>>
>>>>
>>>> It should be OK, because the hosts are in a local network, properly set
>>>> up
>>>> by the IT support.
>>>>
>>>> I guess the conf files should be OK too, because it runs the first two
>>>> jobs
>>>> without a problem only fails with the third. and it runs other hadoop
>>>> examples.
>>>>
>>>> I will look into how to debug a hadoop project, maybe I can trace down
>>>> the
>>>> problem that way.
>>>>
>>>>
>>>
>>>
>

Re: new to hadoop

Posted by Tamas Jambor <ja...@googlemail.com>.

thanks. I have identified the infinite loop. It is in

org.apache.mahout.cf.taste.hadoop.item.UserVectorToCooccurrenceMapper.maybePruneUserVector(UserVectorToCooccurrenceMapper.java:88)

where the resultingSizeAtCutoff variable remains zero, it does not increase.

Tamas

On 04/05/2010 15:35, Vimal Mathew wrote:
> "kill -QUIT" will cause the stack trace to be dumped to stderr (which
> is usually a log file). You can also try
>
> jstack [java process ID]
>
> to read the stack trace directly.
>
> You can use  the "jps" command to list Java processes running on a system.
>
>
>
> On Mon, May 3, 2010 at 7:26 PM, Sean Owen<sr...@gmail.com>  wrote:
>    
>> I think the infinite loop theory is good.
>>
>> As a crude way to debug, you can log on to a worker machine, locate
>> the java process that may be stuck, and:
>>
>> kill -QUIT [java process ID]
>>
>> This just makes it dump its stack for each thread. Do that a few times
>> and you may easily spot an infinite loop situation because it will
>> just be in the same place over and over.
>>
>> http://java.sun.com/developer/technicalArticles/Programming/Stacktrace/
>>
>> On Tue, May 4, 2010 at 12:15 AM, Tamas Jambor<ja...@googlemail.com>  wrote:
>>      
>>> It should be OK, because the hosts are in a local network, properly set up
>>> by the IT support.
>>>
>>> I guess the conf files should be OK too, because it runs the first two jobs
>>> without a problem only fails with the third. and it runs other hadoop
>>> examples.
>>>
>>> I will look into how to debug a hadoop project, maybe I can trace down the
>>> problem that way.
>>>
>>>        
>>

Re: new to hadoop

Posted by Vimal Mathew <vm...@gmail.com>.

"kill -QUIT" will cause the stack trace to be dumped to stderr (which
is usually a log file). You can also try

jstack [java process ID]

to read the stack trace directly.

You can use  the "jps" command to list Java processes running on a system.



On Mon, May 3, 2010 at 7:26 PM, Sean Owen <sr...@gmail.com> wrote:
> I think the infinite loop theory is good.
>
> As a crude way to debug, you can log on to a worker machine, locate
> the java process that may be stuck, and:
>
> kill -QUIT [java process ID]
>
> This just makes it dump its stack for each thread. Do that a few times
> and you may easily spot an infinite loop situation because it will
> just be in the same place over and over.
>
> http://java.sun.com/developer/technicalArticles/Programming/Stacktrace/
>
> On Tue, May 4, 2010 at 12:15 AM, Tamas Jambor <ja...@googlemail.com> wrote:
>> It should be OK, because the hosts are in a local network, properly set up
>> by the IT support.
>>
>> I guess the conf files should be OK too, because it runs the first two jobs
>> without a problem only fails with the third. and it runs other hadoop
>> examples.
>>
>> I will look into how to debug a hadoop project, maybe I can trace down the
>> problem that way.
>>
>

Re: new to hadoop

Posted by Sean Owen <sr...@gmail.com>.

I think the infinite loop theory is good.

As a crude way to debug, you can log on to a worker machine, locate
the java process that may be stuck, and:

kill -QUIT [java process ID]

This just makes it dump its stack for each thread. Do that a few times
and you may easily spot an infinite loop situation because it will
just be in the same place over and over.

http://java.sun.com/developer/technicalArticles/Programming/Stacktrace/

On Tue, May 4, 2010 at 12:15 AM, Tamas Jambor <ja...@googlemail.com> wrote:
> It should be OK, because the hosts are in a local network, properly set up
> by the IT support.
>
> I guess the conf files should be OK too, because it runs the first two jobs
> without a problem only fails with the third. and it runs other hadoop
> examples.
>
> I will look into how to debug a hadoop project, maybe I can trace down the
> problem that way.
>

Re: new to hadoop

Posted by Tamas Jambor <ja...@googlemail.com>.

It should be OK, because the hosts are in a local network, properly set 
up by the IT support.

I guess the conf files should be OK too, because it runs the first two 
jobs without a problem only fails with the third. and it runs other 
hadoop examples.

I will look into how to debug a hadoop project, maybe I can trace down 
the problem that way.

Re: new to hadoop

Posted by Sean Owen <sr...@gmail.com>.

Well, I can tell you about a time I saw something like this, but I am
not sure it is related to your situation.

I had serious problems with tasks being unable to find each other
shortly after startup, like when they first tried to report back to
the namenode. In the end I discovered it was because my mobile
broadband connection would sometimes change the hostname of my laptop!
And so jobs could no longer resolve the hostname of other jobs.

I doubt that's the cause, but, something similar could be happening.
Have you set up the conf/*.xml files in the usual way, per Hadoop
instructions? so you are definitely telling it localhost is where to
find HDFS, etc.?

Re: new to hadoop

Posted by Tamas Jambor <ja...@googlemail.com>.

I can't see anything unusual. these are the logs for 
attempt_201005032003_0006_m_000000_2

-----------------------------------------------------------------------------------------------------------------------

2010-05-03 23:05:48,590 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction (registerTask): attempt_201005032003_0006_m_000000_2 task's state:UNASSIGNED
2010-05-03 23:05:48,590 INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_201005032003_0006_m_000000_2
2010-05-03 23:05:48,590 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current free slots : 7 and trying to launch attempt_201005032003_0006_m_000000_2
2010-05-03 23:05:49,473 INFO org.apache.hadoop.mapred.JvmManager: In JvmRunner constructed JVM ID: jvm_201005032003_0006_m_1086521438
2010-05-03 23:05:49,473 INFO org.apache.hadoop.mapred.JvmManager: JVM Runner jvm_201005032003_0006_m_1086521438 spawned.
2010-05-03 23:05:50,248 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201005032003_0006_m_1086521438 given task: attempt_201005032003_0006_m_000000_2
2010-05-03 23:05:56,784 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201005032003_0006_m_000000_2 0.0% hdfs://bunwell.cs.ucl.ac.uk:54310/user/tjambor/temp/userVectors/part-00000:0+6105463
2010-05-03 23:25:59,246 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201005032003_0006_m_000000_2: Task attempt_201005032003_0006_m_000000_2 failed to report status for 1202 seconds. Killing!
2010-05-03 23:25:59,261 INFO org.apache.hadoop.mapred.TaskTracker: Process Thread Dump: lost task

-----------------------------------------------------------------------------------------

2010-05-03 23:05:48,966 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /128.16.1.46:50010, dest: /128.16.1.46:38302, bytes: 10619777, op: HDFS_READ, cliID: DFSClient_-1485123568, srvID: DS-826409173-128.16.1.46-50010-1272538619979, blockid: blk_7183689076291458667_1382
2010-05-03 23:06:46,099 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /128.16.1.46:50010, dest: /128.16.1.45:39357, bytes: 156864, op: HDFS_READ, cliID: DFSClient_attempt_201005032003_0006_m_000000_1, srvID: DS-826409173-128.16.1.46-50010-1272538619979, blockid: blk_6035162353377502411_1381
2010-05-03 23:11:34,183 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification succeeded for blk_1615080304667462031_1375
2010-05-03 23:16:52,824 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 8 blocks got processed in 4 msecs
2010-05-03 23:25:59,385 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /128.16.1.46:50010, dest: /128.16.1.46:38307, bytes: 185760, op: HDFS_READ, cliID: DFSClient_attempt_201005032003_0006_m_000000_2, srvID: DS-826409173-128.16.1.46-50010-1272538619979, blockid: blk_6035162353377502411_1381
2010-05-03 23:26:59,811 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /128.16.1.46:50010, dest: /128.16.2.130:35911, bytes: 156864, op: HDFS_READ, cliID: DFSClient_attempt_201005032003_0006_m_000000_3, srvID: DS-826409173-128.16.1.46-50010-1272538619979, blockid: blk_6035162353377502411_1381
2010-05-03 23:27:08,722 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_5895961339651429890_1386 src: /128.16.1.45:40451 dest: /128.16.1.46:50010
2010-05-03 23:27:08,727 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /128.16.1.45:40451, dest: /128.16.1.46:50010, bytes: 7585, op: HDFS_WRITE, cliID: DFSClient_-1971088822, srvID: DS-826409173-128.16.1.46-50010-1272538619979, blockid: blk_5895961339651429890_1386
2010-05-03 23:27:08,727 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for block blk_5895961339651429890_1386 terminating
2010-05-03 23:27:17,245 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Deleting block blk_7183689076291458667_1382 file /usr/local/hadoop-datastore/hadoop-tjambor/dfs/data/current/blk_7183689076291458667
2010-05-03 23:31:38,622 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification succeeded for blk_1342409871477974276_1375
2010-05-03 23:33:05,378 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification succeeded for blk_6035162353377502411_1381

---------------------------------------------------------------------------------------

10/05/03 22:45:09 INFO mapred.JobClient: Running job: job_201005032003_0006
10/05/03 22:45:10 INFO mapred.JobClient:  map 0% reduce 0%
10/05/03 22:46:01 INFO mapred.JobClient:  map 1% reduce 0%
10/05/03 22:46:04 INFO mapred.JobClient:  map 2% reduce 0%
10/05/03 22:46:34 INFO mapred.JobClient:  map 9% reduce 0%
10/05/03 22:46:37 INFO mapred.JobClient:  map 28% reduce 0%
10/05/03 22:46:40 INFO mapred.JobClient:  map 50% reduce 0%
10/05/03 22:49:14 INFO mapred.JobClient:  map 50% reduce 16%
10/05/03 23:05:48 INFO mapred.JobClient: Task Id : attempt_201005032003_0006_m_000000_0, Status : FAILED
Task attempt_201005032003_0006_m_000000_0 failed to report status for 1200 seconds. Killing!
10/05/03 23:06:53 INFO mapred.JobClient: Task Id : attempt_201005032003_0006_m_000000_1, Status : FAILED
Task attempt_201005032003_0006_m_000000_1 failed to report status for 1201 seconds. Killing!
10/05/03 23:26:06 INFO mapred.JobClient: Task Id : attempt_201005032003_0006_m_000000_2, Status : FAILED
Task attempt_201005032003_0006_m_000000_2 failed to report status for 1202 seconds. Killing!
10/05/03 23:27:09 INFO mapred.JobClient: Job complete: job_201005032003_0006
10/05/03 23:27:09 INFO mapred.JobClient: Counters: 15
10/05/03 23:27:09 INFO mapred.JobClient:   Job Counters
10/05/03 23:27:09 INFO mapred.JobClient:     Launched reduce tasks=1
10/05/03 23:27:09 INFO mapred.JobClient:     Rack-local map tasks=2
10/05/03 23:27:09 INFO mapred.JobClient:     Launched map tasks=6
10/05/03 23:27:09 INFO mapred.JobClient:     Data-local map tasks=4
10/05/03 23:27:09 INFO mapred.JobClient:     Failed map tasks=1
10/05/03 23:27:09 INFO mapred.JobClient:   FileSystemCounters
10/05/03 23:27:09 INFO mapred.JobClient:     FILE_BYTES_READ=9415218
10/05/03 23:27:09 INFO mapred.JobClient:     HDFS_BYTES_READ=6105577
10/05/03 23:27:09 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=18151812
10/05/03 23:27:09 INFO mapred.JobClient:   Map-Reduce Framework
10/05/03 23:27:09 INFO mapred.JobClient:     Combine output records=3338383
10/05/03 23:27:09 INFO mapred.JobClient:     Map input records=2971
10/05/03 23:27:09 INFO mapred.JobClient:     Spilled Records=6676766
10/05/03 23:27:09 INFO mapred.JobClient:     Map output bytes=58177620
10/05/03 23:27:09 INFO mapred.JobClient:     Map input bytes=6104511
10/05/03 23:27:09 INFO mapred.JobClient:     Combine input records=4848135
10/05/03 23:27:09 INFO mapred.JobClient:     Map output records=4848135
Exception in thread "main" java.io.IOException: Job failed!
         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
         at org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.run(RecommenderJob.java:132)
         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
         at org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.main(RecommenderJob.java:185)
         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
         at java.lang.reflect.Method.invoke(Method.java:597)
         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

Re: new to hadoop

Posted by Sean Owen <sr...@gmail.com>.

It could be something strange like that. You should still be able to
access the task tracker's logs to see output from the jobs, which
could indicate the issue -- are you looking at those logs rather than
the namenode?

On Mon, May 3, 2010 at 8:32 PM, Tamas Jambor <ja...@googlemail.com> wrote:
> I still couldn't figure out the problem, it stucks at the same point. there
> are no errors in the log file.
> maybe it gets into an infinite loop?
>

Re: new to hadoop

Posted by Tamas Jambor <ja...@googlemail.com>.

I still couldn't figure out the problem, it stucks at the same point. 
there are no errors in the log file.
maybe it gets into an infinite loop?

On 03/05/2010 17:23, Sean Owen wrote:
> This indicates a problem with your cluster, it seems. The task worker
> became unreachable or died. YOu would have to dig through the Hadoop
> logs to get more information.
>

Re: new to hadoop

Posted by Sean Owen <sr...@gmail.com>.

This indicates a problem with your cluster, it seems. The task worker
became unreachable or died. YOu would have to dig through the Hadoop
logs to get more information.

How many mapper/reducer tasks are you requesting? If you want to use
more than the default of 1, in Hadoop, you need to set
"-Dmapred.map.tasts=X" and -Dmapred.reduce.tasks=Y"

Re: new to hadoop

Posted by Tamas Jambor <ja...@googlemail.com>.

thanks. one step closer. I needed to assign -Xmx2048m to make it work.

now I get the following error:

Task attempt_201005031452_0003_m_000001_0 failed to report status for 
600 seconds. Killing!

and then it assigns it to other nodes, but they all fail this way.

This is with the third task 
(RecommenderJob-UserVectorToCooccurrenceMapper-UserVectorToCooccurrenceReducer)

by the way I don't understand why the mapper assigns the task only to 2 
nodes, when I run the sample mapreduce word count example, it uses all 
the nodes available.

Tamas

On 03/05/2010 11:51, Sean Owen wrote:
> Not sure I understand the question -- all jobs need to run for the
> recommendations to complete. It is a process with about 5 distinct
> mapreduces. Which one fails with an OOME? they have names, you can see
> in the console.
>
> Are you giving Hadoop workers enough memory? by default they can only
> use like 64MB which is far too little. You need to, for example, in
> conf/mapred-site.xml, add a new property named
> “mapred.child.java.opts” with value “-Xmx1024m” to give workers up to
> 1GB of heap. They probably don't need that much but might as well not
> limit it.
>

Re: new to hadoop

Posted by Sean Owen <sr...@gmail.com>.

Not sure I understand the question -- all jobs need to run for the
recommendations to complete. It is a process with about 5 distinct
mapreduces. Which one fails with an OOME? they have names, you can see
in the console.

Are you giving Hadoop workers enough memory? by default they can only
use like 64MB which is far too little. You need to, for example, in
conf/mapred-site.xml, add a new property named
“mapred.child.java.opts” with value “-Xmx1024m” to give workers up to
1GB of heap. They probably don't need that much but might as well not
limit it.

Re: new to hadoop

Posted by Tamas Jambor <ja...@googlemail.com>.

thanks a lot. it works with the latest version. however, it runs out of 
memory after the 2nd job. what exactly are these jobs? and how many 
should run to finish the task?
(hadoop jar 
/localhome/tjambor/mahout/core/target/mahout-core-0.4-SNAPSHOT.job 
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob 
-Dmapred.input.dir=testdata/100k_data.data -Dmapred.output.dir=output)

I have a small cluster, only 5 computers, some of them does not have a 
lot of memory, but the master server should be fine. besides, I only ran 
the test with the smallest movielens data set (100k).


On 02/05/2010 20:08, Sean Owen wrote:
> It seems it cannot find Vector in the .job file, and I am not sure why
> that would be. You build from source and used the .job file in
> core/target? Should be OK.
>
> Nevertheless I'd suggest you use the latest from SVN instead; if there
> is some issue with 0.3 unfortunately it's hard to help since the code
> has moved on so much since then. It's easy to support what's at head
> now.
>

Re: new to hadoop

Posted by Sean Owen <sr...@gmail.com>.

It seems it cannot find Vector in the .job file, and I am not sure why
that would be. You build from source and used the .job file in
core/target? Should be OK.

Nevertheless I'd suggest you use the latest from SVN instead; if there
is some issue with 0.3 unfortunately it's hard to help since the code
has moved on so much since then. It's easy to support what's at head
now.

Re: new to hadoop

Posted by Tamas Jambor <ja...@googlemail.com>.

thanks. I think I put the data there in a wrong format. It works now up 
until a point but for some reason it fails again:

[tjambor@bunwell ~]$ hadoop jar 
/localhome/tjambor/mahout/mahout-core-0.3.job 
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob --input 
testdata/100k_data.data --output output -t temp --jarFile 
/localhome/tjambor/mahout/mahout-core-0.3.jar
10/05/02 19:22:32 WARN mapred.JobClient: Use GenericOptionsParser for 
parsing the arguments. Applications should implement Tool for the same.
10/05/02 19:22:33 INFO mapred.FileInputFormat: Total input paths to 
process : 1
10/05/02 19:22:33 INFO mapred.JobClient: Running job: job_201004291158_0024
10/05/02 19:22:34 INFO mapred.JobClient:  map 0% reduce 0%
10/05/02 19:22:44 INFO mapred.JobClient:  map 50% reduce 0%
10/05/02 19:22:45 INFO mapred.JobClient:  map 100% reduce 0%
10/05/02 19:22:56 INFO mapred.JobClient:  map 100% reduce 100%
10/05/02 19:22:58 INFO mapred.JobClient: Job complete: job_201004291158_0024
10/05/02 19:22:58 INFO mapred.JobClient: Counters: 19
10/05/02 19:22:58 INFO mapred.JobClient:   Job Counters
10/05/02 19:22:58 INFO mapred.JobClient:     Launched reduce tasks=1
10/05/02 19:22:58 INFO mapred.JobClient:     Rack-local map tasks=1
10/05/02 19:22:58 INFO mapred.JobClient:     Launched map tasks=2
10/05/02 19:22:58 INFO mapred.JobClient:     Data-local map tasks=1
10/05/02 19:22:58 INFO mapred.JobClient:   FileSystemCounters
10/05/02 19:22:58 INFO mapred.JobClient:     FILE_BYTES_READ=1400006
10/05/02 19:22:58 INFO mapred.JobClient:     HDFS_BYTES_READ=981108
10/05/02 19:22:58 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=2800082
10/05/02 19:22:58 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=42610
10/05/02 19:22:58 INFO mapred.JobClient:   Map-Reduce Framework
10/05/02 19:22:58 INFO mapred.JobClient:     Reduce input groups=1682
10/05/02 19:22:58 INFO mapred.JobClient:     Combine output records=0
10/05/02 19:22:58 INFO mapred.JobClient:     Map input records=100000
10/05/02 19:22:58 INFO mapred.JobClient:     Reduce shuffle bytes=691382
10/05/02 19:22:58 INFO mapred.JobClient:     Reduce output records=1682
10/05/02 19:22:58 INFO mapred.JobClient:     Spilled Records=200000
10/05/02 19:22:58 INFO mapred.JobClient:     Map output bytes=1200000
10/05/02 19:22:58 INFO mapred.JobClient:     Map input bytes=979173
10/05/02 19:22:58 INFO mapred.JobClient:     Combine input records=0
10/05/02 19:22:58 INFO mapred.JobClient:     Map output records=100000
10/05/02 19:22:58 INFO mapred.JobClient:     Reduce input records=100000
10/05/02 19:22:58 WARN mapred.JobClient: Use GenericOptionsParser for 
parsing the arguments. Applications should implement Tool for the same.
10/05/02 19:22:58 INFO mapred.FileInputFormat: Total input paths to 
process : 1
10/05/02 19:22:58 INFO mapred.JobClient: Running job: job_201004291158_0025
10/05/02 19:22:59 INFO mapred.JobClient:  map 0% reduce 0%
10/05/02 19:23:09 INFO mapred.JobClient:  map 100% reduce 0%
10/05/02 19:23:20 INFO mapred.JobClient: Task Id : 
attempt_201004291158_0025_r_000000_0, Status : FAILED
Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector
         at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
         at java.security.AccessController.doPrivileged(Native Method)
         at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
         at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
         at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
         at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
         at java.lang.Class.forName0(Native Method)
         at java.lang.Class.forName(Class.java:247)
         at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:762)
         at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:807)
         at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:833)
         at 
org.apache.hadoop.mapred.JobConf.getReducerClass(JobConf.java:832)
         at 
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:426)
         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
         at org.apache.hadoop.mapred.Child.main(Child.java:170)


On 02/05/2010 19:02, Sean Owen wrote:
> (PS you should really try using the latest code from Subversion --
> it's changed a little bit in the arguments, but is much more efficient
> and effective. The javadoc explains the new usage.)
>
> On Sun, May 2, 2010 at 7:01 PM, Sean Owen<sr...@gmail.com>  wrote:
>    
>> --input specifies the data to use, and you have done so. It sounds
>> like it's empty or not in the right format. What is in
>> testdata/test.txt?
>>

Re: new to hadoop

Posted by Sean Owen <sr...@gmail.com>.

(PS you should really try using the latest code from Subversion --
it's changed a little bit in the arguments, but is much more efficient
and effective. The javadoc explains the new usage.)

On Sun, May 2, 2010 at 7:01 PM, Sean Owen <sr...@gmail.com> wrote:
> --input specifies the data to use, and you have done so. It sounds
> like it's empty or not in the right format. What is in
> testdata/test.txt?

Re: new to hadoop

Posted by Sean Owen <sr...@gmail.com>.

--input specifies the data to use, and you have done so. It sounds
like it's empty or not in the right format. What is in
testdata/test.txt?

On Sun, May 2, 2010 at 6:48 PM, Tamas Jambor <ja...@googlemail.com> wrote:
> hi,
>
> I have just started exploring the distributed version of mahout. I wanted to
> start with running the example job as follows:
>
>  hadoop jar mahout-core-0.3.job
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob --input
> testdata/test.txt --output output --tempDir temp --jarFile
> mahout-core-0.3.jar
>
> but I couldn't find a parameter where I can specify the data set the
> recommender will use. I assume this is the reason why the job fails:
>
> 10/05/02 18:40:15 INFO mapred.JobClient: Task Id :
> attempt_201004291158_0018_m_000001_2, Status : FAILED
> java.lang.ArrayIndexOutOfBoundsException: 1
>        at
> org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:42)
>        at
> org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:31)
>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> thanks
>
> Tamas
>