You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Nabarun Sengupta (JIRA)" <ji...@apache.org> on 2012/06/04 04:15:23 UTC
[jira] [Commented] (MAHOUT-1021) Blank csv input file given to Canopy/Kmeans clustering

    [ https://issues.apache.org/jira/browse/MAHOUT-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13288313#comment-13288313 ] 

Nabarun Sengupta commented on MAHOUT-1021:
------------------------------------------

An updates on this issue?
                
> Blank csv input file given to Canopy/Kmeans clustering
> ------------------------------------------------------
>
>                 Key: MAHOUT-1021
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1021
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Integration
>    Affects Versions: 0.6
>         Environment: Mahout 0.6 version on hadoop 0.2, Testing on HadooponAzure platform
>            Reporter: Nabarun Sengupta
>            Assignee: Suneel Marthi
>            Priority: Minor
>             Fix For: Backlog
>
>
> Hi,
> This is regarding a bug that we observed in Canopy clustering. We could reflect the same in Kmeans too. Given a blank csv input file, we observe the algorithm executes two jobs, during the third job execution, it throws an error. When I tried to execute a malformed csv file with decimal or characters, I received an error during the first job itself. Therefore, I feel the same validation should be done if the input file is blank and exception should be thrown during the first job execution.
> Following is the job execution details:
> Apps\dist\mahout\examples\bin>build-cluster-syntheticcontrol.cmd
> ease select a number to choose the corresponding clustering algorithm"
>  canopy clustering"
>  kmeans clustering"
>  fuzzykmeans clustering"
>  dirichlet clustering"
>  meanshift clustering"
> er your choice:1
> . You chose 1 and we'll use canopy Clustering"
> S is healthy... "
> loading Synthetic control data to HDFS"
> eted hdfs://10.114.251.23:9000/user/milind/testdata
> ccessfully Uploaded Synthetic control data to HDFS "
> nning on hadoop, using HADOOP_HOME=c:\Apps\dist"
> Apps\dist\bin\hadoop jar c:\Apps\dist\mahout\mahout-examples-0.5-job.jar org.apache.mahout.driver.MahoutDriver org.apache.mah
> ontrol.canopy.Job
> 05/17 10:46:11 WARN driver.MahoutDriver: No org.apache.mahout.clustering.syntheticcontrol.canopy.Job.props found on classpath
> rguments only
> 05/17 10:46:11 INFO canopy.Job: Running with default arguments
> 05/17 10:46:12 INFO common.HadoopUtil: Deleting output
> 05/17 10:46:12 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool
> 05/17 10:46:13 INFO input.FileInputFormat: Total input paths to process : 1
> 05/17 10:46:14 INFO mapred.JobClient: Running job: job_201205170655_0017
> 05/17 10:46:15 INFO mapred.JobClient:  map 0% reduce 0%
> 05/17 10:46:48 INFO mapred.JobClient:  map 100% reduce 0%
> 05/17 10:46:59 INFO mapred.JobClient: Job complete: job_201205170655_0017
> 05/17 10:46:59 INFO mapred.JobClient: Counters: 15
> 05/17 10:46:59 INFO mapred.JobClient:   Job Counters
> 05/17 10:46:59 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=29672
> 05/17 10:46:59 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
> 05/17 10:46:59 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
> 05/17 10:46:59 INFO mapred.JobClient:     Launched map tasks=1
> 05/17 10:46:59 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
> 05/17 10:46:59 INFO mapred.JobClient:   File Output Format Counters
> 05/17 10:46:59 INFO mapred.JobClient:     Bytes Written=90
> 05/17 10:46:59 INFO mapred.JobClient:   FileSystemCounters
> 05/17 10:46:59 INFO mapred.JobClient:     FILE_BYTES_READ=130
> 05/17 10:46:59 INFO mapred.JobClient:     HDFS_BYTES_READ=134
> 05/17 10:46:59 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=21557
> 05/17 10:46:59 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=90
> 05/17 10:46:59 INFO mapred.JobClient:   File Input Format Counters
> 05/17 10:46:59 INFO mapred.JobClient:     Bytes Read=0
> 05/17 10:46:59 INFO mapred.JobClient:   Map-Reduce Framework
> 05/17 10:46:59 INFO mapred.JobClient:     Map input records=0
> 05/17 10:46:59 INFO mapred.JobClient:     Spilled Records=0
> 05/17 10:46:59 INFO mapred.JobClient:     Map output records=0
> 05/17 10:46:59 INFO mapred.JobClient:     SPLIT_RAW_BYTES=134
> 05/17 10:46:59 INFO canopy.CanopyDriver: Build Clusters Input: output/data Out: output Measure: org.apache.mahout.common.dist
> sure@6eedf759 t1: 80.0 t2: 55.0
> 05/17 10:46:59 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool
> 05/17 10:46:59 INFO input.FileInputFormat: Total input paths to process : 1
> 05/17 10:47:00 INFO mapred.JobClient: Running job: job_201205170655_0018
> 05/17 10:47:01 INFO mapred.JobClient:  map 0% reduce 0%
> 05/17 10:47:33 INFO mapred.JobClient:  map 100% reduce 0%
> 05/17 10:47:51 INFO mapred.JobClient:  map 100% reduce 100%
> 05/17 10:48:02 INFO mapred.JobClient: Job complete: job_201205170655_0018
> 05/17 10:48:02 INFO mapred.JobClient: Counters: 25
> 05/17 10:48:02 INFO mapred.JobClient:   Job Counters
> 05/17 10:48:02 INFO mapred.JobClient:     Launched reduce tasks=1
> 05/17 10:48:02 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=30327
> 05/17 10:48:02 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
> 05/17 10:48:02 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
> 05/17 10:48:02 INFO mapred.JobClient:     Launched map tasks=1
> 05/17 10:48:02 INFO mapred.JobClient:     Data-local map tasks=1
> 05/17 10:48:02 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=16031
> 05/17 10:48:02 INFO mapred.JobClient:   File Output Format Counters
> 05/17 10:48:02 INFO mapred.JobClient:     Bytes Written=95
> 05/17 10:48:02 INFO mapred.JobClient:   FileSystemCounters
> 05/17 10:48:02 INFO mapred.JobClient:     FILE_BYTES_READ=396
> 05/17 10:48:02 INFO mapred.JobClient:     HDFS_BYTES_READ=217
> 05/17 10:48:02 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=45263
> 05/17 10:48:02 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=95
> 05/17 10:48:02 INFO mapred.JobClient:   File Input Format Counters
> 05/17 10:48:02 INFO mapred.JobClient:     Bytes Read=90
> 05/17 10:48:02 INFO mapred.JobClient:   Map-Reduce Framework
> 05/17 10:48:02 INFO mapred.JobClient:     Reduce input groups=0
> 05/17 10:48:02 INFO mapred.JobClient:     Map output materialized bytes=6
> 05/17 10:48:02 INFO mapred.JobClient:     Combine output records=0
> 05/17 10:48:02 INFO mapred.JobClient:     Map input records=0
> 05/17 10:48:02 INFO mapred.JobClient:     Reduce shuffle bytes=0
> 05/17 10:48:02 INFO mapred.JobClient:     Reduce output records=0
> 05/17 10:48:02 INFO mapred.JobClient:     Spilled Records=0
> 05/17 10:48:02 INFO mapred.JobClient:     Map output bytes=0
> 05/17 10:48:02 INFO mapred.JobClient:     Combine input records=0
> 05/17 10:48:02 INFO mapred.JobClient:     Map output records=0
> 05/17 10:48:02 INFO mapred.JobClient:     SPLIT_RAW_BYTES=127
> 05/17 10:48:02 INFO mapred.JobClient:     Reduce input records=0
> 05/17 10:48:02 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool
> 05/17 10:48:03 INFO input.FileInputFormat: Total input paths to process : 1
> 05/17 10:48:03 INFO mapred.JobClient: Running job: job_201205170655_0019
> 05/17 10:48:04 INFO mapred.JobClient:  map 0% reduce 0%
> 05/17 10:48:35 INFO mapred.JobClient: Task Id : attempt_201205170655_0019_m_000000_0, Status : FAILED
> a.lang.IllegalStateException: Canopies are empty!
>      at org.apache.mahout.clustering.canopy.ClusterMapper.setup(ClusterMapper.java:81)
>      at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
>      at org.Aapache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
>      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
>      at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:415)
>      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>      at org.apache.hadoop.mapred.Child.main(Child.java:260)
> empt_201205170655_0019_m_000000_0: log4j:WARN No appenders could be found for logger (org.apache.hadoop.hdfs.DFSClient).
> empt_201205170655_0019_m_000000_0: log4j:WARN Please initialize the log4j system properly.
> 05/17 10:48:53 INFO mapred.JobClient: Task Id : attempt_201205170655_0019_m_000000_1, Status : FAILED
> a.lang.IllegalStateException: Canopies are empty!
>      at org.apache.mahout.clustering.canopy.ClusterMapper.setup(ClusterMapper.java:81)
>      at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
>      at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
>      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
>      at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:415)
>      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>      at org.apache.hadoop.mapred.Child.main(Child.java:260)
> empt_201205170655_0019_m_000000_1: log4j:WARN No appenders could be found for logger (org.apache.hadoop.hdfs.DFSClient).
> empt_201205170655_0019_m_000000_1: log4j:WARN Please initialize the log4j system properly.
> minate batch job (Y/N)? ^V
> Please let me know if this issue can be resolved. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira