You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Ikumasa Mukai (Issue Comment Edited) (JIRA)" <ji...@apache.org> on 2011/12/20 02:57:31 UTC

[jira] [Issue Comment Edited] (MAHOUT-932) RandomForest quits with ArrayIndexOutOfBoundsException while running sample

    [ https://issues.apache.org/jira/browse/MAHOUT-932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172839#comment-13172839 ] 

Ikumasa Mukai edited comment on MAHOUT-932 at 12/20/11 1:57 AM:
----------------------------------------------------------------

Maybe this problem is caused with this process.
{noformat}
1)The number of trees for our forest is 100 (-t 100)
2)Hadoop makes 10 mappers
  This value is calced like 18742306/1874231=10 (the size of data / -Dmapred.max.split.size)
3)Each mapper calcs how many trees should be made. This num shuld be 10 trees(/mapper).
4)For this calculation, the value of mapred.map.tasks which is provided by Hadooop is used.
  With v0.20.2, mapred.map.tasks is 10. 100/10 = 10 trees are made(/mapper).
  With v0.20.204, mapred.map.tasks is 1. 100/1 = 100 trees are made(/mapper).
5)After making the trees, we gather the trees from all mappers to make the forest. 
  On this phase we have an exception with v0.20.204.
6)This is because the total num of trees we gather is 1000! (100 trees x 10 mappers)
  But the size of the prepared array for gathering (TreeID) is 100.
{noformat}
To fix this, we can change the param for calculating on "4)", but this isn't good way I think..
What do you think?

Regards,
                
      was (Author: ikumasa mukai):
    Maybe this problem is caused with this process.

1)The number of trees for our forest is 100 (-t 100)
2)Hadoop makes 10 mappers (this value is calced 18742306/1874231=10 (-Dmapred.max.split.size=1874231 and the size of data=18742306 ))
3)Each mapper calcs how many trees should be made. This num shuld be 10 trees(/mapper).
4)For this calculation, the value of mapred.map.tasks which is provided by Hadooop is used.
  With v0.20.2, mapred.map.tasks is 10. 100/10 = 10 trees are made(/mapper).
  With v0.20.204, mapred.map.tasks is 1. 100/1 = 100 trees are made(/mapper).
5)After making the trees, we gather the trees from all mappers to make the forest. 
  On this phase we have an exception with v0.20.204.
6)This is because the total num of trees we gather is 1000! (100 trees x 10 mappers)
  But the size of the prepared array for gathering (TreeID) is 100.

To fix this, we can change the param for calculating on "4)", but this isn't good way I think..
What do you think?

Regards,
                  
> RandomForest quits with ArrayIndexOutOfBoundsException while running sample
> ---------------------------------------------------------------------------
>
>                 Key: MAHOUT-932
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-932
>             Project: Mahout
>          Issue Type: Bug
>          Components: Classification
>    Affects Versions: 0.6
>         Environment: Mac OS X, current Mac OS shipped Java version, latest checkout from 17.12.2011
> Dual Core MacBook Pro 2009, 8 Gb, SSD
>            Reporter: Berttenfall M.
>            Priority: Minor
>              Labels: Classifier, DecisionForest, RandomForest
>
> Hello,
> when running the example under https://cwiki.apache.org/MAHOUT/partial-implementation.html with the recommended data sets several issues occur.
> First: ARFF files seem no longer to be supported, I've been using the UCI format as recommended here (https://cwiki.apache.org/MAHOUT/breiman-example.html). Using ARFF files, Mahout quits when creating the description file (wrong number of attributes in the string), using UCI format it works.
> The main error happends during the BuildForest step (I could not test TestForest, due to missing tree).
> Running:
> $MAHOUT_HOME/bin/mahout org.apache.mahout.classifier.df.mapreduce.BuildForest -Dmapred.max.split.size=1874231 -d convertedData/data.data -ds KDDTrain+.info -sl 5 -p -t 100 -o nsl-forest.
> I tested different split.size values. 1874231, 187423, 18742 give the following error. 1874 does not finish on my machine (Dual Core MacBook Pro 2009, 8 Gb, SSD).
> It quits after a while (map is almost done) with the following message:
> 11/12/17 16:23:24 INFO mapred.Task: Task 'attempt_local_0001_m_000998_0' done.
> 11/12/17 16:23:24 INFO mapred.Task: Task:attempt_local_0001_m_000999_0 is done. And is in the process of commiting
> 11/12/17 16:23:24 INFO mapred.LocalJobRunner: 
> 11/12/17 16:23:24 INFO mapred.Task: Task attempt_local_0001_m_000999_0 is allowed to commit now
> 11/12/17 16:23:24 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_m_000999_0' to file:/Users/martin/Documents/Studium/Master/LargeScaleProcessing/Repository/mahout_algorithms_evaluation/testingRandomForests/nsl-forest
> 11/12/17 16:23:27 INFO mapred.LocalJobRunner: 
> 11/12/17 16:23:27 INFO mapred.Task: Task 'attempt_local_0001_m_000999_0' done.
> 11/12/17 16:23:28 INFO mapred.JobClient:  map 100% reduce 0%
> 11/12/17 16:23:28 INFO mapred.JobClient: Job complete: job_local_0001
> 11/12/17 16:23:28 INFO mapred.JobClient: Counters: 8
> 11/12/17 16:23:28 INFO mapred.JobClient:   File Output Format Counters 
> 11/12/17 16:23:28 INFO mapred.JobClient:     Bytes Written=41869032
> 11/12/17 16:23:28 INFO mapred.JobClient:   FileSystemCounters
> 11/12/17 16:23:28 INFO mapred.JobClient:     FILE_BYTES_READ=37443033225
> 11/12/17 16:23:28 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=44946910704
> 11/12/17 16:23:28 INFO mapred.JobClient:   File Input Format Counters 
> 11/12/17 16:23:28 INFO mapred.JobClient:     Bytes Read=20478569
> 11/12/17 16:23:28 INFO mapred.JobClient:   Map-Reduce Framework
> 11/12/17 16:23:28 INFO mapred.JobClient:     Map input records=125973
> 11/12/17 16:23:28 INFO mapred.JobClient:     Spilled Records=0
> 11/12/17 16:23:28 INFO mapred.JobClient:     Map output records=100000
> 11/12/17 16:23:28 INFO mapred.JobClient:     SPLIT_RAW_BYTES=215000
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 100
> 	at org.apache.mahout.classifier.df.mapreduce.partial.PartialBuilder.processOutput(PartialBuilder.java:126)
> 	at org.apache.mahout.classifier.df.mapreduce.partial.PartialBuilder.parseOutput(PartialBuilder.java:89)
> 	at org.apache.mahout.classifier.df.mapreduce.Builder.build(Builder.java:303)
> 	at org.apache.mahout.classifier.df.mapreduce.BuildForest.buildForest(BuildForest.java:201)
> 	at org.apache.mahout.classifier.df.mapreduce.BuildForest.run(BuildForest.java:163)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> 	at org.apache.mahout.classifier.df.mapreduce.BuildForest.main(BuildForest.java:225)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> 	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> PS: I adjusted the class to .classifier.df. and removed -oop

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira