You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Dan Brickley (Commented) (JIRA)" <ji...@apache.org> on 2012/02/19 15:12:34 UTC

[jira] [Commented] (MAHOUT-978) spectralkmeans utility fails when input filename begins with leading underscore

    [ https://issues.apache.org/jira/browse/MAHOUT-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211361#comment-13211361 ] 

Dan Brickley commented on MAHOUT-978:
-------------------------------------

According to https://groups.google.com/a/cloudera.org/group/cdh-user/browse_thread/thread/404229d8b0ef044b/eb7f5d17823b63f1 files named with a leading '_' (or '.') are considered hidden files, at least in some aspects of Hadoop/HDFS. More discussion here: http://lucene.472066.n3.nabble.com/do-HDFS-files-starting-with-underscore-have-special-properties-td3305238.html

In this light, I'd recommend treating this as a documentation issue. Not sure which other bits of Mahout use Hadoop APIs that give this same issue. I simply hadn't heard this about '_' in Hadoop, and let my own practice of naming generated files that way leak into my hdfs file naming.


                
> spectralkmeans utility fails when input filename begins with leading underscore
> -------------------------------------------------------------------------------
>
>                 Key: MAHOUT-978
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-978
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: Tested on a real Linux-based cluster running Hadoop 0.20.2-cdh3u2 and the 0.6 release; also OSX pseudo cluster running Hadoop 0.20.203.0 running 16 Feb trunk build.
>            Reporter: Dan Brickley
>            Priority: Minor
>         Attachments: jira-underscore-spectral-log.txt
>
>
> The commandline 'bin/mahout spectralkmeans' utility fails with NoSuchElementException after "Loading vector from: spectral/output/results2/calculations/diagonal/part-r-00000"  when input data in hdfs has filename beginning with a leading underscore.
> This was partially reported in comments for MAHOUT-524 but I believe identified now as a distinct issue (thanks to Shannon for help diagnosing). I have not investigated if there is an equivalent problem for API-based use of this piece of Mahout.
> Steps to reproduce: 
> 1. put affinity file into hdfs, following https://cwiki.apache.org/MAHOUT/spectral-clustering.html - note that node IDs count from zero etc. Name your file with a leading underscore. For example, try http://danbri.org/2012/spectral/dbpedia/_topic_skm.csv and store it in spectral/input/_topic_skm.csv
> (I'll leave that example input file in place unchanged for others to try. It is built from dbpedia data, encoding associations from Wikipedia pages to categories. Whether it is a good use of spectral clustering I'm not sure, but I'd at least hope the job would run to completion.)
> 2. Run 'mahout spectralkmeans -k 20 -d 4192499 -x 7 -i spectral/input/ -o spectral/output/results1'
> 3. Wait for it to fail just after printing "Loading vector from: spectral/output/results1/calculations/diagonal/part-r-00000", with java.util.NoSuchElementException at com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152).
> 4. Rename the file in hdfs to eliminate the leading underscore. Re-run the command (give a different results dir or cleanup from the first run, to avoid mixing the tests). This attempt should succeed and you'll see it proceed deeper into the job, i.e. something like 
> 12/02/19 14:38:32 INFO common.VectorCache: Loading vector from: spectral/output/results2/calculations/diagonal/part-r-00000
> 12/02/19 14:38:41 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
> 12/02/19 14:38:43 INFO input.FileInputFormat: Total input paths to process : 1
> 12/02/19 14:38:44 INFO mapred.JobClient: Running job: job_201202191410_0005
> 12/02/19 14:38:45 INFO mapred.JobClient:  map 0% reduce 0%
> 12/02/19 14:39:31 INFO mapred.JobClient:  map 1% reduce 0%
> (5. You might get a memory-based failure some time later; that is a separate problem.)
> I'll attach a more detailed transcript. I've made no attempt to diagnose internals yet, but did make some other tests and can confirm that it does not seem to matter whether the commandline invocation names the file explicitly, or by directory name only. Also trailing slash does not seem to be an issue. Finally, a related 'gotcha': make sure the results directory is not inside the input directory when testing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira