You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Gokul Pillai <go...@gmail.com> on 2010/07/15 23:19:09 UTC

Help with running clusterdump after running Dirichlet

I have Cloudera's CDH3 running on Ubuntu 10.04 version. And I have Apache
Mahout (0.40 Snapshot version from yesterday).

I was trying to get the clustering examples running based on the wiki page
https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data.
At the bottom of this page, there is a section that describes how to get the
data out and process it.
Get the data out of HDFS  3
<https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data#Footnote3>
 4
<https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data#Footnote4>
and
have a look  5
<https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data#Footnote5>

   - All example jobs use *testdata* as input and output to directory *
   output*
   - Use *bin/hadoop fs -lsr output* to view all outputs. Copy them all to
   your local machine and you can run the ClusterDumper on them.
      - Sequence files containing the original points in Vector form are in
      *output/data*
      - Computed clusters are contained in *output/clusters-i*
      - All result clustered points are placed into *output/clusteredPoints*


So I got the data out of HDFS onto my local and it looks like this:

hadoop@ubuntu:~/mahoutOutputs$ ls -l dirichlet/output/
total 32
drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusteredPoints
drwxr-xr-x 2 hadoop hadoop 4096 2010-07-13 16:06 clusters-0
drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-1
drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-2
drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-3
drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-4
drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-5
drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 data


However, when I ran clusterdump on this, I get the following error. Any help
on why clusterdump is complaining about a "_logs" folder would be helpful:

hadoop@ubuntu:~/mahoutOutputs$ ../mahoutsvn/trunk/bin/mahout clusterdump
--seqFileDir dirichlet/output/clusters-1 --pointsDir
dirichlet/output/clusteredPoints/ --output dumpOut
no HADOOP_CONF_DIR or HADOOP_HOME set, running locally
Exception in thread "main" java.io.FileNotFoundException:
/home/hadoop/mahoutOutputs/dirichlet/output/clusteredPoints/_logs (Is a
directory)
    at java.io.FileInputStream.open(Native Method)
    at java.io.FileInputStream.<init>(FileInputStream.java:106)
    at
org.apache.hadoop.fs.RawLocalFileSystem$TrackingFileInputStream.<init>(RawLocalFileSystem.java:63)
    at
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.<init>(RawLocalFileSystem.java:99)
    at
org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:169)
    at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:126)
    at
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283)
    at
org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1437)
    at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424)
    at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
    at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
    at
org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:323)
    at
org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:93)
    at
org.apache.mahout.utils.clustering.ClusterDumper.<init>(ClusterDumper.java:86)
    at
org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:272)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:175)

Regards
Gokul

Re: Help with running clusterdump after running Dirichlet

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Dunno, I'm not that familiar with the code. Just ran all the unit tests 
and the synthetic control jobs all ran on Hadoop yesterday and produced 
output but they don't use the command line, just the Java API. Once it's 
refactored to AbstractJob it will be possible to write a test for the 
parameter extraction. I will investigate but I'm traveling over the 
weekend starting today and won't have much time until Monday. Be 
interesting to hear what you discover.

Jeff

On 7/16/10 6:37 AM, Robin Anil wrote:
> Nope. doesnt work. will sit and figure out tonight
>
> Robin
>
> On Fri, Jul 16, 2010 at 7:02 PM, Jeff Eastman<jd...@windwardsolutions.com>wrote:
>
>    
>> Try clusterdump -s reuters-clusters/cluster-6 -d... It's expecting a
>> directory to find the cluster parts in and is quite passive about doing
>> nothing if it does not find one. This could obviously be improved; at least
>> an error message would be appropriate. I see it does not extend AbstractJob
>> either. I'll look into that next week.
>>
>> Jeff
>>
>>
>>
>> On 7/16/10 12:24 AM, Robin Anil wrote:
>>
>>      
>>> I am trying to run clusterdumper from trunk. seems like its
>>> not outputting anything. Need to investigate
>>> bin/mahout clusterdump -s reuters-clusters/cluster-6/part-r-00000 -d
>>> reuters-vectors/dictionary.file-0  -dt sequencefile -n 10 -b 100
>>>
>>> On Fri, Jul 16, 2010 at 7:08 AM, Jeff Eastman<jdog@windwardsolutions.com
>>>        
>>>> wrote:
>>>>          
>>>
>>>
>>>        
>>>> Also it looks like you are not passing a clusters-n directory to the
>>>> --seqFileDir as you were in your first posting. ClusterDumper won't
>>>> output
>>>> anything if it cannot read clusters from that directory. Also, all the
>>>> synthetic control jobs now all call ClusterDumper automatically after
>>>> clustering the points.
>>>>
>>>>
>>>> On 7/15/10 5:58 PM, Jeff Eastman wrote:
>>>>
>>>>
>>>>
>>>>          
>>>>> Hi Gokul,
>>>>>
>>>>> Try building and running again. I committed a patch to ClusterDumper
>>>>> which
>>>>> handles the _log file error when running on Hadoop.
>>>>>
>>>>> Jeff
>>>>>
>>>>> On 7/15/10 2:27 PM, Gokul Pillai wrote:
>>>>>
>>>>>
>>>>>
>>>>>            
>>>>>> My bad. After setting HADOOP_CONF_DIR and HADOOP_HOME, I now don't get
>>>>>> the
>>>>>> errors.
>>>>>> However, I dont get any output too.
>>>>>> I tried this command too but again no output:
>>>>>> ./bin/mahout clusterdump --seqFileDir dirichlet/output/data/
>>>>>> --pointsDir
>>>>>> dirichlet/output/clusteredPoints/ --output dumpOut
>>>>>>
>>>>>> Anybody run the clusterdump successfully?
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 15, 2010 at 2:19 PM, Gokul Pillai<go...@gmail.com>
>>>>>>   wrote:
>>>>>>
>>>>>>   I have Cloudera's CDH3 running on Ubuntu 10.04 version. And I have
>>>>>>
>>>>>>
>>>>>>              
>>>>>>> Apache
>>>>>>> Mahout (0.40 Snapshot version from yesterday).
>>>>>>>
>>>>>>> I was trying to get the clustering examples running based on the wiki
>>>>>>> page
>>>>>>>
>>>>>>>
>>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data
>>>>>>> .
>>>>>>>
>>>>>>> At the bottom of this page, there is a section that describes how to
>>>>>>> get
>>>>>>> the data out and process it.
>>>>>>> Get the data out of HDFS  3
>>>>>>> <
>>>>>>>
>>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data#Footnote3
>>>>>>>                
>>>>>>>>                  
>>>>>>>    4
>>>>>>>
>>>>>>> <
>>>>>>>
>>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data#Footnote4
>>>>>>>                
>>>>>>>>                  
>>>>>>>   and
>>>>>>> have a look  5
>>>>>>> <
>>>>>>>
>>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data#Footnote5
>>>>>>>                
>>>>>>>>                  
>>>>>>>
>>>>>>>     - All example jobs use *testdata* as input and output to directory
>>>>>>> *
>>>>>>>     output*
>>>>>>>     - Use *bin/hadoop fs -lsr output* to view all outputs. Copy them
>>>>>>> all
>>>>>>> to
>>>>>>>     your local machine and you can run the ClusterDumper on them.
>>>>>>>        - Sequence files containing the original points in Vector form
>>>>>>> are
>>>>>>>        in *output/data*
>>>>>>>        - Computed clusters are contained in *output/clusters-i*
>>>>>>>        - All result clustered points are placed into *
>>>>>>>        output/clusteredPoints*
>>>>>>>
>>>>>>>
>>>>>>> So I got the data out of HDFS onto my local and it looks like this:
>>>>>>>
>>>>>>> hadoop@ubuntu:~/mahoutOutputs$ ls -l dirichlet/output/
>>>>>>> total 32
>>>>>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusteredPoints
>>>>>>> drwxr-xr-x 2 hadoop hadoop 4096 2010-07-13 16:06 clusters-0
>>>>>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-1
>>>>>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-2
>>>>>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-3
>>>>>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-4
>>>>>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-5
>>>>>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 data
>>>>>>>
>>>>>>>
>>>>>>> However, when I ran clusterdump on this, I get the following error.
>>>>>>> Any
>>>>>>> help on why clusterdump is complaining about a "_logs" folder would be
>>>>>>> helpful:
>>>>>>>
>>>>>>> hadoop@ubuntu:~/mahoutOutputs$ ../mahoutsvn/trunk/bin/mahout
>>>>>>> clusterdump
>>>>>>> --seqFileDir dirichlet/output/clusters-1 --pointsDir
>>>>>>> dirichlet/output/clusteredPoints/ --output dumpOut
>>>>>>> no HADOOP_CONF_DIR or HADOOP_HOME set, running locally
>>>>>>> Exception in thread "main" java.io.FileNotFoundException:
>>>>>>> /home/hadoop/mahoutOutputs/dirichlet/output/clusteredPoints/_logs (Is
>>>>>>> a
>>>>>>> directory)
>>>>>>>      at java.io.FileInputStream.open(Native Method)
>>>>>>>      at java.io.FileInputStream.<init>(FileInputStream.java:106)
>>>>>>>      at
>>>>>>>
>>>>>>> org.apache.hadoop.fs.RawLocalFileSystem$TrackingFileInputStream.<init>(RawLocalFileSystem.java:63)
>>>>>>>
>>>>>>>      at
>>>>>>>
>>>>>>> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.<init>(RawLocalFileSystem.java:99)
>>>>>>>
>>>>>>>      at
>>>>>>>
>>>>>>> org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:169)
>>>>>>>
>>>>>>>      at
>>>>>>>
>>>>>>> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:126)
>>>>>>>
>>>>>>>      at
>>>>>>>
>>>>>>> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283)
>>>>>>>
>>>>>>>      at
>>>>>>>
>>>>>>> org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1437)
>>>>>>>
>>>>>>>      at
>>>>>>>
>>>>>>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424)
>>>>>>>      at
>>>>>>>
>>>>>>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>>>>>>>      at
>>>>>>>
>>>>>>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>>>>>>>      at
>>>>>>>
>>>>>>> org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:323)
>>>>>>>
>>>>>>>      at
>>>>>>>
>>>>>>> org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:93)
>>>>>>>
>>>>>>>      at
>>>>>>>
>>>>>>> org.apache.mahout.utils.clustering.ClusterDumper.<init>(ClusterDumper.java:86)
>>>>>>>
>>>>>>>      at
>>>>>>>
>>>>>>> org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:272)
>>>>>>>
>>>>>>>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>>>      at
>>>>>>>
>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>>>>
>>>>>>>      at
>>>>>>>
>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>>
>>>>>>>      at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>      at
>>>>>>>
>>>>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>>>>>
>>>>>>>      at
>>>>>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>>>>      at
>>>>>>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:175)
>>>>>>>
>>>>>>> Regards
>>>>>>> Gokul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                
>>>>>>              
>>>>>            
>>>>
>>>>          
>>>
>>>        
>>
>>      
>

Re: Help with running clusterdump after running Dirichlet

Posted by Robin Anil <ro...@gmail.com>.

Nope. doesnt work. will sit and figure out tonight

Robin

On Fri, Jul 16, 2010 at 7:02 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> Try clusterdump -s reuters-clusters/cluster-6 -d... It's expecting a
> directory to find the cluster parts in and is quite passive about doing
> nothing if it does not find one. This could obviously be improved; at least
> an error message would be appropriate. I see it does not extend AbstractJob
> either. I'll look into that next week.
>
> Jeff
>
>
>
> On 7/16/10 12:24 AM, Robin Anil wrote:
>
>> I am trying to run clusterdumper from trunk. seems like its
>> not outputting anything. Need to investigate
>> bin/mahout clusterdump -s reuters-clusters/cluster-6/part-r-00000 -d
>> reuters-vectors/dictionary.file-0  -dt sequencefile -n 10 -b 100
>>
>> On Fri, Jul 16, 2010 at 7:08 AM, Jeff Eastman<jdog@windwardsolutions.com
>> >wrote:
>>
>>
>>
>>> Also it looks like you are not passing a clusters-n directory to the
>>> --seqFileDir as you were in your first posting. ClusterDumper won't
>>> output
>>> anything if it cannot read clusters from that directory. Also, all the
>>> synthetic control jobs now all call ClusterDumper automatically after
>>> clustering the points.
>>>
>>>
>>> On 7/15/10 5:58 PM, Jeff Eastman wrote:
>>>
>>>
>>>
>>>> Hi Gokul,
>>>>
>>>> Try building and running again. I committed a patch to ClusterDumper
>>>> which
>>>> handles the _log file error when running on Hadoop.
>>>>
>>>> Jeff
>>>>
>>>> On 7/15/10 2:27 PM, Gokul Pillai wrote:
>>>>
>>>>
>>>>
>>>>> My bad. After setting HADOOP_CONF_DIR and HADOOP_HOME, I now don't get
>>>>> the
>>>>> errors.
>>>>> However, I dont get any output too.
>>>>> I tried this command too but again no output:
>>>>> ./bin/mahout clusterdump --seqFileDir dirichlet/output/data/
>>>>> --pointsDir
>>>>> dirichlet/output/clusteredPoints/ --output dumpOut
>>>>>
>>>>> Anybody run the clusterdump successfully?
>>>>>
>>>>>
>>>>> On Thu, Jul 15, 2010 at 2:19 PM, Gokul Pillai<go...@gmail.com>
>>>>>  wrote:
>>>>>
>>>>>  I have Cloudera's CDH3 running on Ubuntu 10.04 version. And I have
>>>>>
>>>>>
>>>>>> Apache
>>>>>> Mahout (0.40 Snapshot version from yesterday).
>>>>>>
>>>>>> I was trying to get the clustering examples running based on the wiki
>>>>>> page
>>>>>>
>>>>>>
>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data
>>>>>> .
>>>>>>
>>>>>> At the bottom of this page, there is a section that describes how to
>>>>>> get
>>>>>> the data out and process it.
>>>>>> Get the data out of HDFS  3
>>>>>> <
>>>>>>
>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data#Footnote3
>>>>>> >
>>>>>>   4
>>>>>>
>>>>>> <
>>>>>>
>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data#Footnote4
>>>>>> >
>>>>>>  and
>>>>>> have a look  5
>>>>>> <
>>>>>>
>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data#Footnote5
>>>>>> >
>>>>>>
>>>>>>
>>>>>>    - All example jobs use *testdata* as input and output to directory
>>>>>> *
>>>>>>    output*
>>>>>>    - Use *bin/hadoop fs -lsr output* to view all outputs. Copy them
>>>>>> all
>>>>>> to
>>>>>>    your local machine and you can run the ClusterDumper on them.
>>>>>>       - Sequence files containing the original points in Vector form
>>>>>> are
>>>>>>       in *output/data*
>>>>>>       - Computed clusters are contained in *output/clusters-i*
>>>>>>       - All result clustered points are placed into *
>>>>>>       output/clusteredPoints*
>>>>>>
>>>>>>
>>>>>> So I got the data out of HDFS onto my local and it looks like this:
>>>>>>
>>>>>> hadoop@ubuntu:~/mahoutOutputs$ ls -l dirichlet/output/
>>>>>> total 32
>>>>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusteredPoints
>>>>>> drwxr-xr-x 2 hadoop hadoop 4096 2010-07-13 16:06 clusters-0
>>>>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-1
>>>>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-2
>>>>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-3
>>>>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-4
>>>>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-5
>>>>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 data
>>>>>>
>>>>>>
>>>>>> However, when I ran clusterdump on this, I get the following error.
>>>>>> Any
>>>>>> help on why clusterdump is complaining about a "_logs" folder would be
>>>>>> helpful:
>>>>>>
>>>>>> hadoop@ubuntu:~/mahoutOutputs$ ../mahoutsvn/trunk/bin/mahout
>>>>>> clusterdump
>>>>>> --seqFileDir dirichlet/output/clusters-1 --pointsDir
>>>>>> dirichlet/output/clusteredPoints/ --output dumpOut
>>>>>> no HADOOP_CONF_DIR or HADOOP_HOME set, running locally
>>>>>> Exception in thread "main" java.io.FileNotFoundException:
>>>>>> /home/hadoop/mahoutOutputs/dirichlet/output/clusteredPoints/_logs (Is
>>>>>> a
>>>>>> directory)
>>>>>>     at java.io.FileInputStream.open(Native Method)
>>>>>>     at java.io.FileInputStream.<init>(FileInputStream.java:106)
>>>>>>     at
>>>>>>
>>>>>> org.apache.hadoop.fs.RawLocalFileSystem$TrackingFileInputStream.<init>(RawLocalFileSystem.java:63)
>>>>>>
>>>>>>     at
>>>>>>
>>>>>> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.<init>(RawLocalFileSystem.java:99)
>>>>>>
>>>>>>     at
>>>>>>
>>>>>> org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:169)
>>>>>>
>>>>>>     at
>>>>>>
>>>>>> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:126)
>>>>>>
>>>>>>     at
>>>>>>
>>>>>> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283)
>>>>>>
>>>>>>     at
>>>>>>
>>>>>> org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1437)
>>>>>>
>>>>>>     at
>>>>>>
>>>>>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424)
>>>>>>     at
>>>>>>
>>>>>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>>>>>>     at
>>>>>>
>>>>>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>>>>>>     at
>>>>>>
>>>>>> org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:323)
>>>>>>
>>>>>>     at
>>>>>>
>>>>>> org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:93)
>>>>>>
>>>>>>     at
>>>>>>
>>>>>> org.apache.mahout.utils.clustering.ClusterDumper.<init>(ClusterDumper.java:86)
>>>>>>
>>>>>>     at
>>>>>>
>>>>>> org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:272)
>>>>>>
>>>>>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>>     at
>>>>>>
>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>>>
>>>>>>     at
>>>>>>
>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>
>>>>>>     at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>     at
>>>>>>
>>>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>>>>
>>>>>>     at
>>>>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>>>     at
>>>>>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:175)
>>>>>>
>>>>>> Regards
>>>>>> Gokul
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>>
>
>

Re: Help with running clusterdump after running Dirichlet

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Try clusterdump -s reuters-clusters/cluster-6 -d... It's expecting a 
directory to find the cluster parts in and is quite passive about doing 
nothing if it does not find one. This could obviously be improved; at 
least an error message would be appropriate. I see it does not extend 
AbstractJob either. I'll look into that next week.

Jeff


On 7/16/10 12:24 AM, Robin Anil wrote:
> I am trying to run clusterdumper from trunk. seems like its
> not outputting anything. Need to investigate
> bin/mahout clusterdump -s reuters-clusters/cluster-6/part-r-00000 -d
> reuters-vectors/dictionary.file-0  -dt sequencefile -n 10 -b 100
>
> On Fri, Jul 16, 2010 at 7:08 AM, Jeff Eastman<jd...@windwardsolutions.com>wrote:
>
>    
>> Also it looks like you are not passing a clusters-n directory to the
>> --seqFileDir as you were in your first posting. ClusterDumper won't output
>> anything if it cannot read clusters from that directory. Also, all the
>> synthetic control jobs now all call ClusterDumper automatically after
>> clustering the points.
>>
>>
>> On 7/15/10 5:58 PM, Jeff Eastman wrote:
>>
>>      
>>> Hi Gokul,
>>>
>>> Try building and running again. I committed a patch to ClusterDumper which
>>> handles the _log file error when running on Hadoop.
>>>
>>> Jeff
>>>
>>> On 7/15/10 2:27 PM, Gokul Pillai wrote:
>>>
>>>        
>>>> My bad. After setting HADOOP_CONF_DIR and HADOOP_HOME, I now don't get
>>>> the
>>>> errors.
>>>> However, I dont get any output too.
>>>> I tried this command too but again no output:
>>>> ./bin/mahout clusterdump --seqFileDir dirichlet/output/data/ --pointsDir
>>>> dirichlet/output/clusteredPoints/ --output dumpOut
>>>>
>>>> Anybody run the clusterdump successfully?
>>>>
>>>>
>>>> On Thu, Jul 15, 2010 at 2:19 PM, Gokul Pillai<go...@gmail.com>
>>>>   wrote:
>>>>
>>>>   I have Cloudera's CDH3 running on Ubuntu 10.04 version. And I have
>>>>          
>>>>> Apache
>>>>> Mahout (0.40 Snapshot version from yesterday).
>>>>>
>>>>> I was trying to get the clustering examples running based on the wiki
>>>>> page
>>>>>
>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data.
>>>>>
>>>>> At the bottom of this page, there is a section that describes how to get
>>>>> the data out and process it.
>>>>> Get the data out of HDFS  3
>>>>> <
>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data#Footnote3>
>>>>>    4
>>>>>
>>>>> <
>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data#Footnote4>
>>>>>   and
>>>>> have a look  5
>>>>> <
>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data#Footnote5>
>>>>>
>>>>>
>>>>>     - All example jobs use *testdata* as input and output to directory *
>>>>>     output*
>>>>>     - Use *bin/hadoop fs -lsr output* to view all outputs. Copy them all
>>>>> to
>>>>>     your local machine and you can run the ClusterDumper on them.
>>>>>        - Sequence files containing the original points in Vector form are
>>>>>        in *output/data*
>>>>>        - Computed clusters are contained in *output/clusters-i*
>>>>>        - All result clustered points are placed into *
>>>>>        output/clusteredPoints*
>>>>>
>>>>>
>>>>> So I got the data out of HDFS onto my local and it looks like this:
>>>>>
>>>>> hadoop@ubuntu:~/mahoutOutputs$ ls -l dirichlet/output/
>>>>> total 32
>>>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusteredPoints
>>>>> drwxr-xr-x 2 hadoop hadoop 4096 2010-07-13 16:06 clusters-0
>>>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-1
>>>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-2
>>>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-3
>>>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-4
>>>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-5
>>>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 data
>>>>>
>>>>>
>>>>> However, when I ran clusterdump on this, I get the following error. Any
>>>>> help on why clusterdump is complaining about a "_logs" folder would be
>>>>> helpful:
>>>>>
>>>>> hadoop@ubuntu:~/mahoutOutputs$ ../mahoutsvn/trunk/bin/mahout
>>>>> clusterdump
>>>>> --seqFileDir dirichlet/output/clusters-1 --pointsDir
>>>>> dirichlet/output/clusteredPoints/ --output dumpOut
>>>>> no HADOOP_CONF_DIR or HADOOP_HOME set, running locally
>>>>> Exception in thread "main" java.io.FileNotFoundException:
>>>>> /home/hadoop/mahoutOutputs/dirichlet/output/clusteredPoints/_logs (Is a
>>>>> directory)
>>>>>      at java.io.FileInputStream.open(Native Method)
>>>>>      at java.io.FileInputStream.<init>(FileInputStream.java:106)
>>>>>      at
>>>>> org.apache.hadoop.fs.RawLocalFileSystem$TrackingFileInputStream.<init>(RawLocalFileSystem.java:63)
>>>>>
>>>>>      at
>>>>> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.<init>(RawLocalFileSystem.java:99)
>>>>>
>>>>>      at
>>>>> org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:169)
>>>>>
>>>>>      at
>>>>> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:126)
>>>>>
>>>>>      at
>>>>> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283)
>>>>>
>>>>>      at
>>>>> org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1437)
>>>>>
>>>>>      at
>>>>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424)
>>>>>      at
>>>>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>>>>>      at
>>>>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>>>>>      at
>>>>> org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:323)
>>>>>
>>>>>      at
>>>>> org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:93)
>>>>>
>>>>>      at
>>>>> org.apache.mahout.utils.clustering.ClusterDumper.<init>(ClusterDumper.java:86)
>>>>>
>>>>>      at
>>>>> org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:272)
>>>>>
>>>>>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>      at
>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>>
>>>>>      at
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>
>>>>>      at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>      at
>>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>>>
>>>>>      at
>>>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>>      at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:175)
>>>>>
>>>>> Regards
>>>>> Gokul
>>>>>
>>>>>
>>>>>            
>>>        
>>      
>

Re: Help with running clusterdump after running Dirichlet

Posted by Robin Anil <ro...@gmail.com>.

I am trying to run clusterdumper from trunk. seems like its
not outputting anything. Need to investigate
bin/mahout clusterdump -s reuters-clusters/cluster-6/part-r-00000 -d
reuters-vectors/dictionary.file-0  -dt sequencefile -n 10 -b 100

On Fri, Jul 16, 2010 at 7:08 AM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> Also it looks like you are not passing a clusters-n directory to the
> --seqFileDir as you were in your first posting. ClusterDumper won't output
> anything if it cannot read clusters from that directory. Also, all the
> synthetic control jobs now all call ClusterDumper automatically after
> clustering the points.
>
>
> On 7/15/10 5:58 PM, Jeff Eastman wrote:
>
>> Hi Gokul,
>>
>> Try building and running again. I committed a patch to ClusterDumper which
>> handles the _log file error when running on Hadoop.
>>
>> Jeff
>>
>> On 7/15/10 2:27 PM, Gokul Pillai wrote:
>>
>>> My bad. After setting HADOOP_CONF_DIR and HADOOP_HOME, I now don't get
>>> the
>>> errors.
>>> However, I dont get any output too.
>>> I tried this command too but again no output:
>>> ./bin/mahout clusterdump --seqFileDir dirichlet/output/data/ --pointsDir
>>> dirichlet/output/clusteredPoints/ --output dumpOut
>>>
>>> Anybody run the clusterdump successfully?
>>>
>>>
>>> On Thu, Jul 15, 2010 at 2:19 PM, Gokul Pillai<go...@gmail.com>
>>>  wrote:
>>>
>>>  I have Cloudera's CDH3 running on Ubuntu 10.04 version. And I have
>>>> Apache
>>>> Mahout (0.40 Snapshot version from yesterday).
>>>>
>>>> I was trying to get the clustering examples running based on the wiki
>>>> page
>>>>
>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data.
>>>>
>>>> At the bottom of this page, there is a section that describes how to get
>>>> the data out and process it.
>>>> Get the data out of HDFS  3
>>>> <
>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data#Footnote3>
>>>>   4
>>>>
>>>> <
>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data#Footnote4>
>>>>  and
>>>> have a look  5
>>>> <
>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data#Footnote5>
>>>>
>>>>
>>>>    - All example jobs use *testdata* as input and output to directory *
>>>>    output*
>>>>    - Use *bin/hadoop fs -lsr output* to view all outputs. Copy them all
>>>> to
>>>>    your local machine and you can run the ClusterDumper on them.
>>>>       - Sequence files containing the original points in Vector form are
>>>>       in *output/data*
>>>>       - Computed clusters are contained in *output/clusters-i*
>>>>       - All result clustered points are placed into *
>>>>       output/clusteredPoints*
>>>>
>>>>
>>>> So I got the data out of HDFS onto my local and it looks like this:
>>>>
>>>> hadoop@ubuntu:~/mahoutOutputs$ ls -l dirichlet/output/
>>>> total 32
>>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusteredPoints
>>>> drwxr-xr-x 2 hadoop hadoop 4096 2010-07-13 16:06 clusters-0
>>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-1
>>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-2
>>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-3
>>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-4
>>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-5
>>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 data
>>>>
>>>>
>>>> However, when I ran clusterdump on this, I get the following error. Any
>>>> help on why clusterdump is complaining about a "_logs" folder would be
>>>> helpful:
>>>>
>>>> hadoop@ubuntu:~/mahoutOutputs$ ../mahoutsvn/trunk/bin/mahout
>>>> clusterdump
>>>> --seqFileDir dirichlet/output/clusters-1 --pointsDir
>>>> dirichlet/output/clusteredPoints/ --output dumpOut
>>>> no HADOOP_CONF_DIR or HADOOP_HOME set, running locally
>>>> Exception in thread "main" java.io.FileNotFoundException:
>>>> /home/hadoop/mahoutOutputs/dirichlet/output/clusteredPoints/_logs (Is a
>>>> directory)
>>>>     at java.io.FileInputStream.open(Native Method)
>>>>     at java.io.FileInputStream.<init>(FileInputStream.java:106)
>>>>     at
>>>> org.apache.hadoop.fs.RawLocalFileSystem$TrackingFileInputStream.<init>(RawLocalFileSystem.java:63)
>>>>
>>>>     at
>>>> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.<init>(RawLocalFileSystem.java:99)
>>>>
>>>>     at
>>>> org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:169)
>>>>
>>>>     at
>>>> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:126)
>>>>
>>>>     at
>>>> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283)
>>>>
>>>>     at
>>>> org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1437)
>>>>
>>>>     at
>>>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424)
>>>>     at
>>>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>>>>     at
>>>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>>>>     at
>>>> org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:323)
>>>>
>>>>     at
>>>> org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:93)
>>>>
>>>>     at
>>>> org.apache.mahout.utils.clustering.ClusterDumper.<init>(ClusterDumper.java:86)
>>>>
>>>>     at
>>>> org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:272)
>>>>
>>>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>     at
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>
>>>>     at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>
>>>>     at java.lang.reflect.Method.invoke(Method.java:597)
>>>>     at
>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>>
>>>>     at
>>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>     at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:175)
>>>>
>>>> Regards
>>>> Gokul
>>>>
>>>>
>>
>

Re: Help with running clusterdump after running Dirichlet

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Also it looks like you are not passing a clusters-n directory to the 
--seqFileDir as you were in your first posting. ClusterDumper won't 
output anything if it cannot read clusters from that directory. Also, 
all the synthetic control jobs now all call ClusterDumper automatically 
after clustering the points.

On 7/15/10 5:58 PM, Jeff Eastman wrote:
> Hi Gokul,
>
> Try building and running again. I committed a patch to ClusterDumper 
> which handles the _log file error when running on Hadoop.
>
> Jeff
>
> On 7/15/10 2:27 PM, Gokul Pillai wrote:
>> My bad. After setting HADOOP_CONF_DIR and HADOOP_HOME, I now don't 
>> get the
>> errors.
>> However, I dont get any output too.
>> I tried this command too but again no output:
>> ./bin/mahout clusterdump --seqFileDir dirichlet/output/data/ --pointsDir
>> dirichlet/output/clusteredPoints/ --output dumpOut
>>
>> Anybody run the clusterdump successfully?
>>
>>
>> On Thu, Jul 15, 2010 at 2:19 PM, Gokul Pillai<go...@gmail.com>  
>> wrote:
>>
>>> I have Cloudera's CDH3 running on Ubuntu 10.04 version. And I have 
>>> Apache
>>> Mahout (0.40 Snapshot version from yesterday).
>>>
>>> I was trying to get the clustering examples running based on the 
>>> wiki page
>>> https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data. 
>>>
>>> At the bottom of this page, there is a section that describes how to 
>>> get
>>> the data out and process it.
>>> Get the data out of HDFS  3
>>> <https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data#Footnote3>   
>>> 4
>>>
>>> <https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data#Footnote4>  
>>> and
>>> have a look  5
>>> <https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data#Footnote5> 
>>>
>>>
>>>     - All example jobs use *testdata* as input and output to 
>>> directory *
>>>     output*
>>>     - Use *bin/hadoop fs -lsr output* to view all outputs. Copy them 
>>> all to
>>>     your local machine and you can run the ClusterDumper on them.
>>>        - Sequence files containing the original points in Vector 
>>> form are
>>>        in *output/data*
>>>        - Computed clusters are contained in *output/clusters-i*
>>>        - All result clustered points are placed into *
>>>        output/clusteredPoints*
>>>
>>>
>>> So I got the data out of HDFS onto my local and it looks like this:
>>>
>>> hadoop@ubuntu:~/mahoutOutputs$ ls -l dirichlet/output/
>>> total 32
>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusteredPoints
>>> drwxr-xr-x 2 hadoop hadoop 4096 2010-07-13 16:06 clusters-0
>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-1
>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-2
>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-3
>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-4
>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-5
>>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 data
>>>
>>>
>>> However, when I ran clusterdump on this, I get the following error. Any
>>> help on why clusterdump is complaining about a "_logs" folder would be
>>> helpful:
>>>
>>> hadoop@ubuntu:~/mahoutOutputs$ ../mahoutsvn/trunk/bin/mahout 
>>> clusterdump
>>> --seqFileDir dirichlet/output/clusters-1 --pointsDir
>>> dirichlet/output/clusteredPoints/ --output dumpOut
>>> no HADOOP_CONF_DIR or HADOOP_HOME set, running locally
>>> Exception in thread "main" java.io.FileNotFoundException:
>>> /home/hadoop/mahoutOutputs/dirichlet/output/clusteredPoints/_logs (Is a
>>> directory)
>>>      at java.io.FileInputStream.open(Native Method)
>>>      at java.io.FileInputStream.<init>(FileInputStream.java:106)
>>>      at
>>> org.apache.hadoop.fs.RawLocalFileSystem$TrackingFileInputStream.<init>(RawLocalFileSystem.java:63) 
>>>
>>>      at
>>> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.<init>(RawLocalFileSystem.java:99) 
>>>
>>>      at
>>> org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:169) 
>>>
>>>      at
>>> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:126) 
>>>
>>>      at
>>> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283) 
>>>
>>>      at
>>> org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1437) 
>>>
>>>      at
>>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424)
>>>      at
>>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>>>      at
>>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>>>      at
>>> org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:323) 
>>>
>>>      at
>>> org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:93) 
>>>
>>>      at
>>> org.apache.mahout.utils.clustering.ClusterDumper.<init>(ClusterDumper.java:86) 
>>>
>>>      at
>>> org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:272) 
>>>
>>>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>      at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
>>>
>>>      at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 
>>>
>>>      at java.lang.reflect.Method.invoke(Method.java:597)
>>>      at
>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) 
>>>
>>>      at 
>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>      at 
>>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:175)
>>>
>>> Regards
>>> Gokul
>>>
>

Re: Help with running clusterdump after running Dirichlet

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Hi Gokul,

Try building and running again. I committed a patch to ClusterDumper 
which handles the _log file error when running on Hadoop.

Jeff

On 7/15/10 2:27 PM, Gokul Pillai wrote:
> My bad. After setting HADOOP_CONF_DIR and HADOOP_HOME, I now don't get the
> errors.
> However, I dont get any output too.
> I tried this command too but again no output:
> ./bin/mahout clusterdump --seqFileDir dirichlet/output/data/ --pointsDir
> dirichlet/output/clusteredPoints/ --output dumpOut
>
> Anybody run the clusterdump successfully?
>
>
> On Thu, Jul 15, 2010 at 2:19 PM, Gokul Pillai<go...@gmail.com>  wrote:
>
>    
>> I have Cloudera's CDH3 running on Ubuntu 10.04 version. And I have Apache
>> Mahout (0.40 Snapshot version from yesterday).
>>
>> I was trying to get the clustering examples running based on the wiki page
>> https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data.
>> At the bottom of this page, there is a section that describes how to get
>> the data out and process it.
>> Get the data out of HDFS  3
>> <https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data#Footnote3>   4
>>
>> <https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data#Footnote4>  and
>> have a look  5
>> <https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data#Footnote5>
>>
>>     - All example jobs use *testdata* as input and output to directory *
>>     output*
>>     - Use *bin/hadoop fs -lsr output* to view all outputs. Copy them all to
>>     your local machine and you can run the ClusterDumper on them.
>>        - Sequence files containing the original points in Vector form are
>>        in *output/data*
>>        - Computed clusters are contained in *output/clusters-i*
>>        - All result clustered points are placed into *
>>        output/clusteredPoints*
>>
>>
>> So I got the data out of HDFS onto my local and it looks like this:
>>
>> hadoop@ubuntu:~/mahoutOutputs$ ls -l dirichlet/output/
>> total 32
>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusteredPoints
>> drwxr-xr-x 2 hadoop hadoop 4096 2010-07-13 16:06 clusters-0
>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-1
>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-2
>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-3
>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-4
>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-5
>> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 data
>>
>>
>> However, when I ran clusterdump on this, I get the following error. Any
>> help on why clusterdump is complaining about a "_logs" folder would be
>> helpful:
>>
>> hadoop@ubuntu:~/mahoutOutputs$ ../mahoutsvn/trunk/bin/mahout clusterdump
>> --seqFileDir dirichlet/output/clusters-1 --pointsDir
>> dirichlet/output/clusteredPoints/ --output dumpOut
>> no HADOOP_CONF_DIR or HADOOP_HOME set, running locally
>> Exception in thread "main" java.io.FileNotFoundException:
>> /home/hadoop/mahoutOutputs/dirichlet/output/clusteredPoints/_logs (Is a
>> directory)
>>      at java.io.FileInputStream.open(Native Method)
>>      at java.io.FileInputStream.<init>(FileInputStream.java:106)
>>      at
>> org.apache.hadoop.fs.RawLocalFileSystem$TrackingFileInputStream.<init>(RawLocalFileSystem.java:63)
>>      at
>> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.<init>(RawLocalFileSystem.java:99)
>>      at
>> org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:169)
>>      at
>> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:126)
>>      at
>> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283)
>>      at
>> org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1437)
>>      at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424)
>>      at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>>      at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>>      at
>> org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:323)
>>      at
>> org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:93)
>>      at
>> org.apache.mahout.utils.clustering.ClusterDumper.<init>(ClusterDumper.java:86)
>>      at
>> org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:272)
>>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>      at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>      at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>      at java.lang.reflect.Method.invoke(Method.java:597)
>>      at
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>      at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>      at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:175)
>>
>> Regards
>> Gokul
>>
>>      
>

Re: Help with running clusterdump after running Dirichlet

Posted by Gokul Pillai <go...@gmail.com>.

My bad. After setting HADOOP_CONF_DIR and HADOOP_HOME, I now don't get the
errors.
However, I dont get any output too.
I tried this command too but again no output:
./bin/mahout clusterdump --seqFileDir dirichlet/output/data/ --pointsDir
dirichlet/output/clusteredPoints/ --output dumpOut

Anybody run the clusterdump successfully?


On Thu, Jul 15, 2010 at 2:19 PM, Gokul Pillai <go...@gmail.com> wrote:

> I have Cloudera's CDH3 running on Ubuntu 10.04 version. And I have Apache
> Mahout (0.40 Snapshot version from yesterday).
>
> I was trying to get the clustering examples running based on the wiki page
> https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data.
> At the bottom of this page, there is a section that describes how to get
> the data out and process it.
> Get the data out of HDFS  3
> <https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data#Footnote3>  4
>
> <https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data#Footnote4> and
> have a look  5
> <https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data#Footnote5>
>
>    - All example jobs use *testdata* as input and output to directory *
>    output*
>    - Use *bin/hadoop fs -lsr output* to view all outputs. Copy them all to
>    your local machine and you can run the ClusterDumper on them.
>       - Sequence files containing the original points in Vector form are
>       in *output/data*
>       - Computed clusters are contained in *output/clusters-i*
>       - All result clustered points are placed into *
>       output/clusteredPoints*
>
>
> So I got the data out of HDFS onto my local and it looks like this:
>
> hadoop@ubuntu:~/mahoutOutputs$ ls -l dirichlet/output/
> total 32
> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusteredPoints
> drwxr-xr-x 2 hadoop hadoop 4096 2010-07-13 16:06 clusters-0
> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-1
> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-2
> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-3
> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-4
> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 clusters-5
> drwxr-xr-x 3 hadoop hadoop 4096 2010-07-13 16:06 data
>
>
> However, when I ran clusterdump on this, I get the following error. Any
> help on why clusterdump is complaining about a "_logs" folder would be
> helpful:
>
> hadoop@ubuntu:~/mahoutOutputs$ ../mahoutsvn/trunk/bin/mahout clusterdump
> --seqFileDir dirichlet/output/clusters-1 --pointsDir
> dirichlet/output/clusteredPoints/ --output dumpOut
> no HADOOP_CONF_DIR or HADOOP_HOME set, running locally
> Exception in thread "main" java.io.FileNotFoundException:
> /home/hadoop/mahoutOutputs/dirichlet/output/clusteredPoints/_logs (Is a
> directory)
>     at java.io.FileInputStream.open(Native Method)
>     at java.io.FileInputStream.<init>(FileInputStream.java:106)
>     at
> org.apache.hadoop.fs.RawLocalFileSystem$TrackingFileInputStream.<init>(RawLocalFileSystem.java:63)
>     at
> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.<init>(RawLocalFileSystem.java:99)
>     at
> org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:169)
>     at
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:126)
>     at
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283)
>     at
> org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1437)
>     at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424)
>     at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>     at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>     at
> org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:323)
>     at
> org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:93)
>     at
> org.apache.mahout.utils.clustering.ClusterDumper.<init>(ClusterDumper.java:86)
>     at
> org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:272)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>     at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>     at java.lang.reflect.Method.invoke(Method.java:597)
>     at
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>     at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>     at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:175)
>
> Regards
> Gokul
>