You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by eric skinner <er...@gmail.com> on 2011/08/09 19:07:50 UTC
Is this a bug or a setup issue for using NewsKMeasnClustering.java
Hello,
I am practicing the NewsKMeansClustering.java, an example code given in
chapter 9 of Mahout-in-Action? I run this program against a directory of
sequence files. The output error message is as follows:
Exception in thread "main" java.io.FileNotFoundException:* File
newsClusters/clustersclusteredPoints/part-m-00000 does not exist*.
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:676)
at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1417)
at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1412)
at
mia.clustering.ch09.NewsKMeansClustering.main(NewsKMeansClustering.java:76)
As reference, the directory structure of the result generated after running
this program is shown as follows as well:
~/workspaceMahout1/recommender/newsClusters% ls
canopy-centroids clusters df-count dictionary.file-0 frequency.file-0
tfidf-vectors tf-vectors tokenized-documents wordcount
~/workspaceMahout1/recommender/newsClusters/clusters/clusteredPoints% ls
part-m-00000
Afterwards, I change the code from the original one
new Path(clusterOutput+Cluster.CLUSTERED_POINTS_DIR +”/part-m-00000”), conf);
to
*new Path(clusterOutput+”/clusteredPoints”+”/part-m-00000”), conf);*
The program can go through without giving the above error messages. I would
like to know is that a bug in the original code or are there any other
hidden issues?
Re: Is this a bug or a setup issue for using NewsKMeasnClustering.java
Posted by Frank Scholten <fr...@frankscholten.nl>.
On Tue, Aug 9, 2011 at 7:42 PM, eric skinner <er...@gmail.com> wrote:
> Frank,
>
> what did you mean "there is a / missing between clustersOutput and
> clusteredPoints in the path."
The clustering job outputs points in the subdirectory
'clusteredPoints' directly under the given output path.
> Afterwards, I change the code from the original one
> new Path(clusterOutput+Cluster.CLUSTERED_POINTS_DIR +”/part-m-00000”), conf);
> to
> *new Path(clusterOutput+”/clusteredPoints”+”/part-m-00000”), conf);*
AFAIK the Cluster.CLUSTERED_POINTS_DIR constant does not start with a /
So when you added the / in front of 'clusteredPoints' it worked.
You can also use a SequenceFileDirIterable to iterate through the points.
Frank
>
>
> I just tried two more new approaches of setting up pathes
> *new
> Path(clusterOutput+"/clusters"+"/clusteredPoints"+"/part-m-00000"),conf);
> new Path(clusterOutput+"/clusters/clusteredPoints"+"/part-m-00000"),conf);
>
> *Both of them causes the following error messages:
> File newsClusters/clusters/clusters/clusteredPoints/part-m-00000 does not
> exist.
>
> It seems to me that "clusteredPoints" inherently equals to
> "/clusters/clusteredPoints". The original code given in "Mahout in Action"
> uses *Cluster.**CLUSTERED_POINTS_DIR *However, their usages causes error
> message as well, like what I included in my previous post,
> *File newsClusters/clustersclusteredPoints/part-m-00000 does not exist*.
>
> This really confuses a lot.
>
> Thanks.
>
>
> On Tue, Aug 9, 2011 at 1:22 PM, Frank Scholten <fr...@frankscholten.nl>wrote:
>
>> It seems there is a / missing between clustersOutput and
>> clusteredPoints in the path.
>>
>> Cheers,
>>
>> Frank
>>
>> Sent from a Hungarian keyboard at Sziget festival
>>
>> On Tue, Aug 9, 2011 at 7:07 PM, eric skinner <er...@gmail.com>
>> wrote:
>> > Hello,
>> >
>> > I am practicing the NewsKMeansClustering.java, an example code given in
>> > chapter 9 of Mahout-in-Action? I run this program against a directory of
>> > sequence files. The output error message is as follows:
>> >
>> > Exception in thread "main" java.io.FileNotFoundException:* File
>> > newsClusters/clustersclusteredPoints/part-m-00000 does not exist*.
>> > at
>> >
>> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
>> > at
>> >
>> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
>> >
>> > at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:676)
>> > at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1417)
>> > at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1412)
>> > at
>> >
>> mia.clustering.ch09.NewsKMeansClustering.main(NewsKMeansClustering.java:76)
>> >
>> > As reference, the directory structure of the result generated after
>> running
>> > this program is shown as follows as well:
>> >
>> > ~/workspaceMahout1/recommender/newsClusters% ls
>> > canopy-centroids clusters df-count dictionary.file-0 frequency.file-0
>> > tfidf-vectors tf-vectors tokenized-documents wordcount
>> > ~/workspaceMahout1/recommender/newsClusters/clusters/clusteredPoints% ls
>> > part-m-00000
>> >
>> > Afterwards, I change the code from the original one
>> >
>> > new Path(clusterOutput+Cluster.CLUSTERED_POINTS_DIR +”/part-m-00000”),
>> conf);
>> >
>> >
>> > to
>> >
>> > *new Path(clusterOutput+”/clusteredPoints”+”/part-m-00000”), conf);*
>> >
>> >
>> > The program can go through without giving the above error messages. I
>> would
>> > like to know is that a bug in the original code or are there any other
>> > hidden issues?
>> >
>>
>
Re: Is this a bug or a setup issue for using NewsKMeasnClustering.java
Posted by eric skinner <er...@gmail.com>.
Frank,
what did you mean "there is a / missing between clustersOutput and
clusteredPoints in the path."
I just tried two more new approaches of setting up pathes
*new
Path(clusterOutput+"/clusters"+"/clusteredPoints"+"/part-m-00000"),conf);
new Path(clusterOutput+"/clusters/clusteredPoints"+"/part-m-00000"),conf);
*Both of them causes the following error messages:
File newsClusters/clusters/clusters/clusteredPoints/part-m-00000 does not
exist.
It seems to me that "clusteredPoints" inherently equals to
"/clusters/clusteredPoints". The original code given in "Mahout in Action"
uses *Cluster.**CLUSTERED_POINTS_DIR *However, their usages causes error
message as well, like what I included in my previous post,
*File newsClusters/clustersclusteredPoints/part-m-00000 does not exist*.
This really confuses a lot.
Thanks.
On Tue, Aug 9, 2011 at 1:22 PM, Frank Scholten <fr...@frankscholten.nl>wrote:
> It seems there is a / missing between clustersOutput and
> clusteredPoints in the path.
>
> Cheers,
>
> Frank
>
> Sent from a Hungarian keyboard at Sziget festival
>
> On Tue, Aug 9, 2011 at 7:07 PM, eric skinner <er...@gmail.com>
> wrote:
> > Hello,
> >
> > I am practicing the NewsKMeansClustering.java, an example code given in
> > chapter 9 of Mahout-in-Action? I run this program against a directory of
> > sequence files. The output error message is as follows:
> >
> > Exception in thread "main" java.io.FileNotFoundException:* File
> > newsClusters/clustersclusteredPoints/part-m-00000 does not exist*.
> > at
> >
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
> > at
> >
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
> >
> > at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:676)
> > at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1417)
> > at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1412)
> > at
> >
> mia.clustering.ch09.NewsKMeansClustering.main(NewsKMeansClustering.java:76)
> >
> > As reference, the directory structure of the result generated after
> running
> > this program is shown as follows as well:
> >
> > ~/workspaceMahout1/recommender/newsClusters% ls
> > canopy-centroids clusters df-count dictionary.file-0 frequency.file-0
> > tfidf-vectors tf-vectors tokenized-documents wordcount
> > ~/workspaceMahout1/recommender/newsClusters/clusters/clusteredPoints% ls
> > part-m-00000
> >
> > Afterwards, I change the code from the original one
> >
> > new Path(clusterOutput+Cluster.CLUSTERED_POINTS_DIR +”/part-m-00000”),
> conf);
> >
> >
> > to
> >
> > *new Path(clusterOutput+”/clusteredPoints”+”/part-m-00000”), conf);*
> >
> >
> > The program can go through without giving the above error messages. I
> would
> > like to know is that a bug in the original code or are there any other
> > hidden issues?
> >
>
Re: Is this a bug or a setup issue for using NewsKMeasnClustering.java
Posted by Frank Scholten <fr...@frankscholten.nl>.
It seems there is a / missing between clustersOutput and
clusteredPoints in the path.
Cheers,
Frank
Sent from a Hungarian keyboard at Sziget festival
On Tue, Aug 9, 2011 at 7:07 PM, eric skinner <er...@gmail.com> wrote:
> Hello,
>
> I am practicing the NewsKMeansClustering.java, an example code given in
> chapter 9 of Mahout-in-Action? I run this program against a directory of
> sequence files. The output error message is as follows:
>
> Exception in thread "main" java.io.FileNotFoundException:* File
> newsClusters/clustersclusteredPoints/part-m-00000 does not exist*.
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
> at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
>
> at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:676)
> at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1417)
> at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1412)
> at
> mia.clustering.ch09.NewsKMeansClustering.main(NewsKMeansClustering.java:76)
>
> As reference, the directory structure of the result generated after running
> this program is shown as follows as well:
>
> ~/workspaceMahout1/recommender/newsClusters% ls
> canopy-centroids clusters df-count dictionary.file-0 frequency.file-0
> tfidf-vectors tf-vectors tokenized-documents wordcount
> ~/workspaceMahout1/recommender/newsClusters/clusters/clusteredPoints% ls
> part-m-00000
>
> Afterwards, I change the code from the original one
>
> new Path(clusterOutput+Cluster.CLUSTERED_POINTS_DIR +”/part-m-00000”), conf);
>
>
> to
>
> *new Path(clusterOutput+”/clusteredPoints”+”/part-m-00000”), conf);*
>
>
> The program can go through without giving the above error messages. I would
> like to know is that a bug in the original code or are there any other
> hidden issues?
>