You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by ipshita chatterji <si...@gmail.com> on 2011/12/14 16:54:11 UTC

Query on clusterdumper output and clusteredPoints

Hi,

I am a newbie in Mahout and also have elementary knowledge of
clustering. I managed to cluster my data using meanshift and then ran
clusterdumper, I get following output:

MSV-21{n=1 c=[1:0...........]

So I asssume that the cluster above has converged and n=1 indicates
that there is only one point associated with the cluster above.

Now I try to read the members of this cluster from "clusteredPoints"
directory. I see from the output that number of points belonging this
cluster is 173.

Why is this mismatch happening? Am I missing something here?

Thanks,
Ipshita

Re: Query on clusterdumper output and clusteredPoints

Posted by ipshita chatterji <si...@gmail.com>.

yes I am using  0.6 for all of the steps mentioned below..

On Fri, Dec 16, 2011 at 10:47 AM, Paritosh Ranjan <pr...@xebia.com> wrote:
> Are you using 0,6-snapshot, for every thing i.e. clustering, post
> processing, and clusterdumer?
>
> And are you keeping the parameter sequential same for
> clutering/postprocessing?
>
>
> On 15-12-2011 20:01, ipshita chatterji wrote:
>>
>> Hi,
>> I wrote my own code to read member variables from one of the
>> directories generated by the postprocessor. I still get a mismatch
>> between the number of clusters generated by clusterdumper and after
>> reading the members.
>>
>> Please see my code snippet below. What am I missing here?
>> Clusterdumper displays:
>>
>> MSV-115{n=3 c=[0:-0.030, 1:0.003, 2:-0.032, 3:-0.053,
>> 4:0.001,.................]
>>
>> which means there are 3 members belonging to this centroid where as
>> the code below generates 412 points.
>> <code>
>>     Configuration conf = new Configuration();
>>     //FileSystem fs = FileSystem.get(pointsDir.toUri(), conf);
>>     FileSystem fs = pointsDir.getFileSystem(conf);
>>     Path mypath = new Path("output1512/pp/115");
>>     //System.out.println(" fs "+fs.getName());
>>     try{
>>           process(mypath,fs,conf);
>>        }
>>     catch(Exception e)
>>     {
>>           System.out.println("Exception :: "+e.getMessage());
>>           e.printStackTrace();
>>     }
>>
>>
>> public void process(Path clusteredPoints, FileSystem
>> fileSystem,Configuration conf)throws Exception {
>>      FileStatus[] partFiles =
>> getAllClusteredPointPartFiles(clusteredPoints,fileSystem);
>>      for (FileStatus partFile : partFiles) {
>>          SequenceFile.Reader clusteredPointsReader = new
>> SequenceFile.Reader(fileSystem, partFile.getPath(), conf);
>>          WritableComparable clusterIdAsKey = (WritableComparable)
>> clusteredPointsReader.getKeyClass().newInstance();
>>          Writable vector = (Writable)
>> clusteredPointsReader.getValueClass().newInstance();
>>          while (clusteredPointsReader.next(clusterIdAsKey, vector)) {
>>              //use clusterId and vector here to write to a local file.
>>              //IntWritable clusterIdAsKey1 = new IntWritable();
>>              Text clusterIdAsKey1 = new Text();
>>              //WeightedVectorWritable point1 = new
>> WeightedVectorWritable();
>>              VectorWritable point1 = new VectorWritable();
>>
>>              findClusterAndAddVector(clusteredPointsReader,
>> clusterIdAsKey1, point1);
>>          }
>>          clusteredPointsReader.close();
>>      }
>>    }
>>
>>   private void findClusterAndAddVector(SequenceFile.Reader
>> clusteredPointsReader,
>>                                        //IntWritable clusterIdAsKey1,
>>                                        Text clusterIdAsKey1,
>>                                        VectorWritable point1) throws
>> IOException {
>>     while (clusteredPointsReader.next(clusterIdAsKey1, point1)) {
>>       //String clusterId = clusterIdAsKey1.toString().trim();
>>       //String point = point1.toString();
>>       //System.out.println("Adding point to cluster " + clusterId);
>>       org.apache.mahout.math.Vector vec= point1.get();
>>       System.out.println(vec.asFormatString());
>>     }
>>   }
>>
>>
>> private FileStatus[] getAllClusteredPointPartFiles(Path
>> clusteredPoints, FileSystem fileSystem) throws IOException {
>>      System.out.println(" clusteredPoints :: "+clusteredPoints.getName());
>>
>>      System.out.println(" fileSystem:: "+fileSystem.getName());
>>
>>      //Path[] partFilePaths =
>> FileUtil.stat2Paths(fileSystem.globStatus(clusteredPoints,
>> PathFilters.partFilter()));
>>      Path[] partFilePaths =
>> FileUtil.stat2Paths(fileSystem.globStatus(clusteredPoints));
>>
>>      int size=partFilePaths.length;
>>      System.out.println("Lenght :: "+size);
>>      FileStatus[] partFileStatuses =
>> fileSystem.listStatus(partFilePaths, PathFilters.partFilter());
>>      return partFileStatuses;
>>    }
>>
>>
>>
>> On Thu, Dec 15, 2011 at 2:19 PM, Paritosh Ranjan<pr...@xebia.com>
>>  wrote:
>>>
>>> If you want to put data in to local file system, I think you will have to
>>> read the data in the cluster directories (output of postprocessor), one
>>> by
>>> one and write it on your local system.
>>>
>>> I am not sure what ClusterDumper does, if it also does the same
>>> thing(reads
>>> clusters output and writes output on local file system), then you can use
>>> it
>>> on all the directories produced by postprocessor.
>>>
>>>
>>> On 15-12-2011 14:07, ipshita chatterji wrote:
>>>>
>>>> I have used ClusterOutputPostProcessorDriver. Now how do I read the
>>>> output generated by postprocessor? Is there a tool for that too?
>>>>
>>>> On Thu, Dec 15, 2011 at 10:37 AM, Paritosh Ranjan<pr...@xebia.com>
>>>>  wrote:
>>>>>
>>>>> Some typo in previous mail. Please read :
>>>>>
>>>>> ...which will post process your clustering output and group vectors
>>>>> belonging to different clusters in their respective directories...
>>>>>
>>>>>
>>>>> On 15-12-2011 10:34, Paritosh Ranjan wrote:
>>>>>>
>>>>>> You don't need to write your own code for analyzing clustered points.
>>>>>> You
>>>>>> can use ClusterOutputPostProcessorDriver which will post process your
>>>>>> clusters and group clusters belonging to different clusters in their
>>>>>> respective directories. You won't get any OOM here.
>>>>>>
>>>>>> Example of using it is here
>>>>>> https://cwiki.apache.org/MAHOUT/top-down-clustering.html.
>>>>>>
>>>>>> And I would advice to use the current 0.6-snapshot snapshot to do
>>>>>> clustering as well as post processing it.
>>>>>> Using 0.5 to use clustering and 0.6-snapshot to write code to post
>>>>>> process
>>>>>> might create problems.
>>>>>>
>>>>>> Paritosh
>>>>>>
>>>>>> On 15-12-2011 08:37, ipshita chatterji wrote:
>>>>>>>
>>>>>>> Actually clustering was done using 0.5 version of mahout but I am
>>>>>>> using the clusterterdumper code from current version of mahout
>>>>>>> present
>>>>>>> in "trunk" to analyze the clusters. To make it run I renamed the
>>>>>>> final
>>>>>>> cluster by appending "-final".
>>>>>>> I got the OOM error even after increasing the mahout heapsize and
>>>>>>> hence had written a code of my own to analyze the clusters by reading
>>>>>>> "-clusteredPoints".
>>>>>>>
>>>>>>> Thu, Dec 15, 2011 at 2:58 AM, Gary Snider<ga...@gmail.com>
>>>>>>>  wrote:
>>>>>>>
>>>>>>>> Ok.  See if you can get the --pointsDir working and post what you
>>>>>>>> get.
>>>>>>>>  Also for seqFileDir do you have a directory with the word 'final'
>>>>>>>> in
>>>>>>>> it?
>>>>>>>>
>>>>>>>> On Dec 14, 2011, at 12:37 PM, ipshita
>>>>>>>> chatterji<si...@gmail.com>
>>>>>>>>  wrote:
>>>>>>>>
>>>>>>>>> For clusterdumper I had following commandline:
>>>>>>>>>
>>>>>>>>> $MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-6
>>>>>>>>> --output clusteranalyze.txt
>>>>>>>>>
>>>>>>>>> Have written a separate program to read clusteredOutput directory
>>>>>>>>> as
>>>>>>>>> clusterdumper with "--pointsDir output/clusteredPoints " was giving
>>>>>>>>> OOM exception.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>> On Wed, Dec 14, 2011 at 10:06 PM, Gary
>>>>>>>>> Snider<ga...@gmail.com>
>>>>>>>>>  wrote:
>>>>>>>>>>
>>>>>>>>>> What was on your command line?  e.g. seqFileDir, pointsDir, etc
>>>>>>>>>> On Wed, Dec 14, 2011 at 10:54 AM, ipshita
>>>>>>>>>> chatterji<si...@gmail.com>wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I am a newbie in Mahout and also have elementary knowledge of
>>>>>>>>>>> clustering. I managed to cluster my data using meanshift and then
>>>>>>>>>>> ran
>>>>>>>>>>> clusterdumper, I get following output:
>>>>>>>>>>>
>>>>>>>>>>> MSV-21{n=1 c=[1:0...........]
>>>>>>>>>>>
>>>>>>>>>>> So I asssume that the cluster above has converged and n=1
>>>>>>>>>>> indicates
>>>>>>>>>>> that there is only one point associated with the cluster above.
>>>>>>>>>>>
>>>>>>>>>>> Now I try to read the members of this cluster from
>>>>>>>>>>> "clusteredPoints"
>>>>>>>>>>> directory. I see from the output that number of points belonging
>>>>>>>>>>> this
>>>>>>>>>>> cluster is 173.
>>>>>>>>>>>
>>>>>>>>>>> Why is this mismatch happening? Am I missing something here?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Ipshita
>>>>>>>>>>>
>>>>>>> -----
>>>>>>> No virus found in this message.
>>>>>>> Checked by AVG - www.avg.com
>>>>>>> Version: 10.0.1415 / Virus Database: 2102/4080 - Release Date:
>>>>>>> 12/14/11
>>>>>>>
>>>>>>
>>>>>> -----
>>>>>> No virus found in this message.
>>>>>> Checked by AVG - www.avg.com
>>>>>> Version: 10.0.1415 / Virus Database: 2108/4081 - Release Date:
>>>>>> 12/14/11
>>>>>
>>>>>
>>>> -----
>>>> No virus found in this message.
>>>> Checked by AVG - www.avg.com
>>>> Version: 10.0.1415 / Virus Database: 2108/4081 - Release Date: 12/14/11
>>
>>
>> -----
>> No virus found in this message.
>> Checked by AVG - www.avg.com
>> Version: 10.0.1415 / Virus Database: 2108/4082 - Release Date: 12/15/11
>>
>

Re: Query on clusterdumper output and clusteredPoints

Posted by Paritosh Ranjan <pr...@xebia.com>.

Are you using 0,6-snapshot, for every thing i.e. clustering, post 
processing, and clusterdumer?

And are you keeping the parameter sequential same for 
clutering/postprocessing?

On 15-12-2011 20:01, ipshita chatterji wrote:
> Hi,
> I wrote my own code to read member variables from one of the
> directories generated by the postprocessor. I still get a mismatch
> between the number of clusters generated by clusterdumper and after
> reading the members.
>
> Please see my code snippet below. What am I missing here?
> Clusterdumper displays:
>
> MSV-115{n=3 c=[0:-0.030, 1:0.003, 2:-0.032, 3:-0.053, 4:0.001,.................]
>
> which means there are 3 members belonging to this centroid where as
> the code below generates 412 points.
> <code>
>      Configuration conf = new Configuration();
>      //FileSystem fs = FileSystem.get(pointsDir.toUri(), conf);
>      FileSystem fs = pointsDir.getFileSystem(conf);
>      Path mypath = new Path("output1512/pp/115");
>      //System.out.println(" fs "+fs.getName());
>      try{
>            process(mypath,fs,conf);
>         }
>      catch(Exception e)
>      {
>            System.out.println("Exception :: "+e.getMessage());
>            e.printStackTrace();
>      }
>
>
> public void process(Path clusteredPoints, FileSystem
> fileSystem,Configuration conf)throws Exception {
>       FileStatus[] partFiles =
> getAllClusteredPointPartFiles(clusteredPoints,fileSystem);
>       for (FileStatus partFile : partFiles) {
>    	  SequenceFile.Reader clusteredPointsReader = new
> SequenceFile.Reader(fileSystem, partFile.getPath(), conf);
>           WritableComparable clusterIdAsKey = (WritableComparable)
> clusteredPointsReader.getKeyClass().newInstance();
>           Writable vector = (Writable)
> clusteredPointsReader.getValueClass().newInstance();
>           while (clusteredPointsReader.next(clusterIdAsKey, vector)) {
>               //use clusterId and vector here to write to a local file.
>               //IntWritable clusterIdAsKey1 = new IntWritable();
>               Text clusterIdAsKey1 = new Text();
>               //WeightedVectorWritable point1 = new WeightedVectorWritable();
>               VectorWritable point1 = new VectorWritable();
>
>               findClusterAndAddVector(clusteredPointsReader,
> clusterIdAsKey1, point1);
>           }
>           clusteredPointsReader.close();
>       }
>     }
>
>    private void findClusterAndAddVector(SequenceFile.Reader
> clusteredPointsReader,
>                                         //IntWritable clusterIdAsKey1,
>                                         Text clusterIdAsKey1,
>                                         VectorWritable point1) throws
> IOException {
>      while (clusteredPointsReader.next(clusterIdAsKey1, point1)) {
>        //String clusterId = clusterIdAsKey1.toString().trim();
>        //String point = point1.toString();
>        //System.out.println("Adding point to cluster " + clusterId);
>        org.apache.mahout.math.Vector vec= point1.get();
>        System.out.println(vec.asFormatString());
>      }
>    }
>
>
> private FileStatus[] getAllClusteredPointPartFiles(Path
> clusteredPoints, FileSystem fileSystem) throws IOException {
>       System.out.println(" clusteredPoints :: "+clusteredPoints.getName());
>
>       System.out.println(" fileSystem:: "+fileSystem.getName());
>
>       //Path[] partFilePaths =
> FileUtil.stat2Paths(fileSystem.globStatus(clusteredPoints,
> PathFilters.partFilter()));
>       Path[] partFilePaths =
> FileUtil.stat2Paths(fileSystem.globStatus(clusteredPoints));
>
>       int size=partFilePaths.length;
>       System.out.println("Lenght :: "+size);
>       FileStatus[] partFileStatuses =
> fileSystem.listStatus(partFilePaths, PathFilters.partFilter());
>       return partFileStatuses;
>     }
>
>
>
> On Thu, Dec 15, 2011 at 2:19 PM, Paritosh Ranjan<pr...@xebia.com>  wrote:
>> If you want to put data in to local file system, I think you will have to
>> read the data in the cluster directories (output of postprocessor), one by
>> one and write it on your local system.
>>
>> I am not sure what ClusterDumper does, if it also does the same thing(reads
>> clusters output and writes output on local file system), then you can use it
>> on all the directories produced by postprocessor.
>>
>>
>> On 15-12-2011 14:07, ipshita chatterji wrote:
>>> I have used ClusterOutputPostProcessorDriver. Now how do I read the
>>> output generated by postprocessor? Is there a tool for that too?
>>>
>>> On Thu, Dec 15, 2011 at 10:37 AM, Paritosh Ranjan<pr...@xebia.com>
>>>   wrote:
>>>> Some typo in previous mail. Please read :
>>>>
>>>> ...which will post process your clustering output and group vectors
>>>> belonging to different clusters in their respective directories...
>>>>
>>>>
>>>> On 15-12-2011 10:34, Paritosh Ranjan wrote:
>>>>> You don't need to write your own code for analyzing clustered points.
>>>>> You
>>>>> can use ClusterOutputPostProcessorDriver which will post process your
>>>>> clusters and group clusters belonging to different clusters in their
>>>>> respective directories. You won't get any OOM here.
>>>>>
>>>>> Example of using it is here
>>>>> https://cwiki.apache.org/MAHOUT/top-down-clustering.html.
>>>>>
>>>>> And I would advice to use the current 0.6-snapshot snapshot to do
>>>>> clustering as well as post processing it.
>>>>> Using 0.5 to use clustering and 0.6-snapshot to write code to post
>>>>> process
>>>>> might create problems.
>>>>>
>>>>> Paritosh
>>>>>
>>>>> On 15-12-2011 08:37, ipshita chatterji wrote:
>>>>>> Actually clustering was done using 0.5 version of mahout but I am
>>>>>> using the clusterterdumper code from current version of mahout present
>>>>>> in "trunk" to analyze the clusters. To make it run I renamed the final
>>>>>> cluster by appending "-final".
>>>>>> I got the OOM error even after increasing the mahout heapsize and
>>>>>> hence had written a code of my own to analyze the clusters by reading
>>>>>> "-clusteredPoints".
>>>>>>
>>>>>> Thu, Dec 15, 2011 at 2:58 AM, Gary Snider<ga...@gmail.com>
>>>>>>   wrote:
>>>>>>
>>>>>>> Ok.  See if you can get the --pointsDir working and post what you get.
>>>>>>>   Also for seqFileDir do you have a directory with the word 'final' in
>>>>>>> it?
>>>>>>>
>>>>>>> On Dec 14, 2011, at 12:37 PM, ipshita chatterji<si...@gmail.com>
>>>>>>>   wrote:
>>>>>>>
>>>>>>>> For clusterdumper I had following commandline:
>>>>>>>>
>>>>>>>> $MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-6
>>>>>>>> --output clusteranalyze.txt
>>>>>>>>
>>>>>>>> Have written a separate program to read clusteredOutput directory as
>>>>>>>> clusterdumper with "--pointsDir output/clusteredPoints " was giving
>>>>>>>> OOM exception.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> On Wed, Dec 14, 2011 at 10:06 PM, Gary
>>>>>>>> Snider<ga...@gmail.com>
>>>>>>>>   wrote:
>>>>>>>>> What was on your command line?  e.g. seqFileDir, pointsDir, etc
>>>>>>>>> On Wed, Dec 14, 2011 at 10:54 AM, ipshita
>>>>>>>>> chatterji<si...@gmail.com>wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I am a newbie in Mahout and also have elementary knowledge of
>>>>>>>>>> clustering. I managed to cluster my data using meanshift and then
>>>>>>>>>> ran
>>>>>>>>>> clusterdumper, I get following output:
>>>>>>>>>>
>>>>>>>>>> MSV-21{n=1 c=[1:0...........]
>>>>>>>>>>
>>>>>>>>>> So I asssume that the cluster above has converged and n=1 indicates
>>>>>>>>>> that there is only one point associated with the cluster above.
>>>>>>>>>>
>>>>>>>>>> Now I try to read the members of this cluster from
>>>>>>>>>> "clusteredPoints"
>>>>>>>>>> directory. I see from the output that number of points belonging
>>>>>>>>>> this
>>>>>>>>>> cluster is 173.
>>>>>>>>>>
>>>>>>>>>> Why is this mismatch happening? Am I missing something here?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Ipshita
>>>>>>>>>>
>>>>>> -----
>>>>>> No virus found in this message.
>>>>>> Checked by AVG - www.avg.com
>>>>>> Version: 10.0.1415 / Virus Database: 2102/4080 - Release Date: 12/14/11
>>>>>>
>>>>>
>>>>> -----
>>>>> No virus found in this message.
>>>>> Checked by AVG - www.avg.com
>>>>> Version: 10.0.1415 / Virus Database: 2108/4081 - Release Date: 12/14/11
>>>>
>>> -----
>>> No virus found in this message.
>>> Checked by AVG - www.avg.com
>>> Version: 10.0.1415 / Virus Database: 2108/4081 - Release Date: 12/14/11
>
> -----
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 10.0.1415 / Virus Database: 2108/4082 - Release Date: 12/15/11
>

Re: Query on clusterdumper output and clusteredPoints

Posted by Paritosh Ranjan <pr...@xebia.com>.

For now, the problem to be summarized is that, clusterdumper is not 
giving proper results.

Is the clusterdumper able to process mapreduce clustered data? I will 
suggest that also try running everything sequentially, if clusterdumper 
is only suitable for sequential things, then that might be the problem.

I usually read the output myself with my own code. Try using that too, 
its easy. if you still get different results, then either you are not 
using ClusterDumper in the correct way, or there is a bug in it.


On 16-12-2011 15:56, ipshita chatterji wrote:
> Hi Paritosh,
> As mentioned earlier the mismatch is in the number of member variables
> belonging to a cluster. Please see my email below:
>
> "I managed to cluster my data using meanshift and then ran
> clusterdumper, I get following output:
>
> MSV-21{n=1 c=[1:0...........]
>
> So I asssume that the cluster above has converged and n=1 indicates
> that there is only one point associated with the cluster above.
>
> Now I try to read the members of this cluster from "clusteredPoints"
> directory. I see from the output that number of points belonging this
> cluster is 173."
>
> This mismatch persists even after using 0.6 snapshot.
>
> Thanks,
> Ipshita
>
> On Fri, Dec 16, 2011 at 3:18 PM, Paritosh Ranjan<pr...@xebia.com>  wrote:
>> /I have used this from 0.6 snapshot and the number of clusters matches the
>> number of clusters generated by clusterdumper./
>>
>> This was the previous mismatch. What exactly is the mismatch now?
>>
>> Have you analyzed the vectors inside each cluster? Are they being clustered
>> properly. If not, you might need to tune your clustering algorithm and its
>> parameters. If yes, then its being clustered properly.
>>
>>
>>
>> On 16-12-2011 14:58, ipshita chatterji wrote:
>>> Hi,
>>> Thanks for the pointers. Please see my replies inline>>
>>>
>>> You can use ClusterCountReader to find out the number of clusters in the
>>> output.
>>>>> I have used this from 0.6 snapshot and the number of clusters matches
>>>>> the number of clusters generated by clusterdumper.
>>> I think doing following things will fulfill your requirement:
>>>
>>> 1) Use 0.6-snapshot all along.
>>>>> Used but the mismatch persists
>>> 2) Do clustering ( Note how you did it : sequentially or mapreduce way )
>>>>> mapreduce way
>>> 3) Run ClusterOutputPostProcessorDriver ( the same way as in step 2 :
>>> sequentially or mapreduce way ) and after that, read vectors of the
>>> clusters
>>>>> same as (2) above
>>> 4) Analyze whether the vectors have been clustered properly according
>>> to your requirement.
>>>
>>> Have I missed anything now?
>>>
>>> Thanks,
>>> Ipshita
>>> On Fri, Dec 16, 2011 at 11:00 AM, Paritosh Ranjan<pr...@xebia.com>
>>>   wrote:
>>>> /"I still get a mismatch between the number of clusters generated by
>>>> clusterdumper and after reading the members. "/
>>>>
>>>> You can use ClusterCountReader to find out the number of clusters in the
>>>> output.
>>>>
>>>> I think doing following things will fulfill your requirement:
>>>>
>>>> 1) Use 0.6-snapshot all along.
>>>> 2) Do clustering ( Note how you did it : sequentially or mapreduce way )
>>>> 3) Run ClusterOutputPostProcessorDriver ( the same way as in step 2 :
>>>> sequentially or mapreduce way ) and after that, read vectors of the
>>>> clusters
>>>> 4) Analyze whether the vectors have been clustered properly according to
>>>> your requirement.
>>>>
>>>>
>>>>
>>>>
>>>> On 15-12-2011 20:01, ipshita chatterji wrote:
>>>>> I still get a mismatch
>>>>> between the number of clusters generated by clusterdumper and after
>>>>> reading the members.
>>>>
>>> -----
>>> No virus found in this message.
>>> Checked by AVG - www.avg.com
>>> Version: 10.0.1415 / Virus Database: 2108/4083 - Release Date: 12/15/11
>>
>
> -----
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 10.0.1415 / Virus Database: 2108/4083 - Release Date: 12/15/11
>

Re: Query on clusterdumper output and clusteredPoints

Posted by ipshita chatterji <si...@gmail.com>.

Hi Paritosh,
As mentioned earlier the mismatch is in the number of member variables
belonging to a cluster. Please see my email below:

"I managed to cluster my data using meanshift and then ran
clusterdumper, I get following output:

MSV-21{n=1 c=[1:0...........]

So I asssume that the cluster above has converged and n=1 indicates
that there is only one point associated with the cluster above.

Now I try to read the members of this cluster from "clusteredPoints"
directory. I see from the output that number of points belonging this
cluster is 173."

This mismatch persists even after using 0.6 snapshot.

Thanks,
Ipshita

On Fri, Dec 16, 2011 at 3:18 PM, Paritosh Ranjan <pr...@xebia.com> wrote:
> /I have used this from 0.6 snapshot and the number of clusters matches the
> number of clusters generated by clusterdumper./
>
> This was the previous mismatch. What exactly is the mismatch now?
>
> Have you analyzed the vectors inside each cluster? Are they being clustered
> properly. If not, you might need to tune your clustering algorithm and its
> parameters. If yes, then its being clustered properly.
>
>
>
> On 16-12-2011 14:58, ipshita chatterji wrote:
>>
>> Hi,
>> Thanks for the pointers. Please see my replies inline>>
>>
>> You can use ClusterCountReader to find out the number of clusters in the
>> output.
>>>>
>>>> I have used this from 0.6 snapshot and the number of clusters matches
>>>> the number of clusters generated by clusterdumper.
>>
>> I think doing following things will fulfill your requirement:
>>
>> 1) Use 0.6-snapshot all along.
>>>>
>>>> Used but the mismatch persists
>>
>> 2) Do clustering ( Note how you did it : sequentially or mapreduce way )
>>>>
>>>> mapreduce way
>>
>> 3) Run ClusterOutputPostProcessorDriver ( the same way as in step 2 :
>> sequentially or mapreduce way ) and after that, read vectors of the
>> clusters
>>>>
>>>> same as (2) above
>>
>> 4) Analyze whether the vectors have been clustered properly according
>> to your requirement.
>>
>> Have I missed anything now?
>>
>> Thanks,
>> Ipshita
>> On Fri, Dec 16, 2011 at 11:00 AM, Paritosh Ranjan<pr...@xebia.com>
>>  wrote:
>>>
>>> /"I still get a mismatch between the number of clusters generated by
>>> clusterdumper and after reading the members. "/
>>>
>>> You can use ClusterCountReader to find out the number of clusters in the
>>> output.
>>>
>>> I think doing following things will fulfill your requirement:
>>>
>>> 1) Use 0.6-snapshot all along.
>>> 2) Do clustering ( Note how you did it : sequentially or mapreduce way )
>>> 3) Run ClusterOutputPostProcessorDriver ( the same way as in step 2 :
>>> sequentially or mapreduce way ) and after that, read vectors of the
>>> clusters
>>> 4) Analyze whether the vectors have been clustered properly according to
>>> your requirement.
>>>
>>>
>>>
>>>
>>> On 15-12-2011 20:01, ipshita chatterji wrote:
>>>>
>>>> I still get a mismatch
>>>> between the number of clusters generated by clusterdumper and after
>>>> reading the members.
>>>
>>>
>>
>> -----
>> No virus found in this message.
>> Checked by AVG - www.avg.com
>> Version: 10.0.1415 / Virus Database: 2108/4083 - Release Date: 12/15/11
>
>

Re: Query on clusterdumper output and clusteredPoints

Posted by Paritosh Ranjan <pr...@xebia.com>.

/I have used this from 0.6 snapshot and the number of clusters matches the number of clusters generated by clusterdumper./

This was the previous mismatch. What exactly is the mismatch now?

Have you analyzed the vectors inside each cluster? Are they being 
clustered properly. If not, you might need to tune your clustering 
algorithm and its parameters. If yes, then its being clustered properly.


On 16-12-2011 14:58, ipshita chatterji wrote:
> Hi,
> Thanks for the pointers. Please see my replies inline>>
>
> You can use ClusterCountReader to find out the number of clusters in the output.
>>> I have used this from 0.6 snapshot and the number of clusters matches the number of clusters generated by clusterdumper.
> I think doing following things will fulfill your requirement:
>
> 1) Use 0.6-snapshot all along.
>>> Used but the mismatch persists
> 2) Do clustering ( Note how you did it : sequentially or mapreduce way )
>>> mapreduce way
> 3) Run ClusterOutputPostProcessorDriver ( the same way as in step 2 :
> sequentially or mapreduce way ) and after that, read vectors of the
> clusters
>>> same as (2) above
> 4) Analyze whether the vectors have been clustered properly according
> to your requirement.
>
> Have I missed anything now?
>
> Thanks,
> Ipshita
> On Fri, Dec 16, 2011 at 11:00 AM, Paritosh Ranjan<pr...@xebia.com>  wrote:
>> /"I still get a mismatch between the number of clusters generated by
>> clusterdumper and after reading the members. "/
>>
>> You can use ClusterCountReader to find out the number of clusters in the
>> output.
>>
>> I think doing following things will fulfill your requirement:
>>
>> 1) Use 0.6-snapshot all along.
>> 2) Do clustering ( Note how you did it : sequentially or mapreduce way )
>> 3) Run ClusterOutputPostProcessorDriver ( the same way as in step 2 :
>> sequentially or mapreduce way ) and after that, read vectors of the clusters
>> 4) Analyze whether the vectors have been clustered properly according to
>> your requirement.
>>
>>
>>
>>
>> On 15-12-2011 20:01, ipshita chatterji wrote:
>>> I still get a mismatch
>>> between the number of clusters generated by clusterdumper and after
>>> reading the members.
>>
>
> -----
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 10.0.1415 / Virus Database: 2108/4083 - Release Date: 12/15/11

Re: Query on clusterdumper output and clusteredPoints

Posted by ipshita chatterji <si...@gmail.com>.

Hi,
Thanks for the pointers. Please see my replies inline>>

You can use ClusterCountReader to find out the number of clusters in the output.
>> I have used this from 0.6 snapshot and the number of clusters matches the number of clusters generated by clusterdumper.

I think doing following things will fulfill your requirement:

1) Use 0.6-snapshot all along.
>> Used but the mismatch persists
2) Do clustering ( Note how you did it : sequentially or mapreduce way )
>> mapreduce way
3) Run ClusterOutputPostProcessorDriver ( the same way as in step 2 :
sequentially or mapreduce way ) and after that, read vectors of the
clusters
>> same as (2) above

4) Analyze whether the vectors have been clustered properly according
to your requirement.

Have I missed anything now?

Thanks,
Ipshita
On Fri, Dec 16, 2011 at 11:00 AM, Paritosh Ranjan <pr...@xebia.com> wrote:
> /"I still get a mismatch between the number of clusters generated by
> clusterdumper and after reading the members. "/
>
> You can use ClusterCountReader to find out the number of clusters in the
> output.
>
> I think doing following things will fulfill your requirement:
>
> 1) Use 0.6-snapshot all along.
> 2) Do clustering ( Note how you did it : sequentially or mapreduce way )
> 3) Run ClusterOutputPostProcessorDriver ( the same way as in step 2 :
> sequentially or mapreduce way ) and after that, read vectors of the clusters
> 4) Analyze whether the vectors have been clustered properly according to
> your requirement.
>
>
>
>
> On 15-12-2011 20:01, ipshita chatterji wrote:
>>
>> I still get a mismatch
>> between the number of clusters generated by clusterdumper and after
>> reading the members.
>
>

Re: Query on clusterdumper output and clusteredPoints

Posted by Paritosh Ranjan <pr...@xebia.com>.

/"I still get a mismatch between the number of clusters generated by 
clusterdumper and after reading the members. "/

You can use ClusterCountReader to find out the number of clusters in the 
output.

I think doing following things will fulfill your requirement:

1) Use 0.6-snapshot all along.
2) Do clustering ( Note how you did it : sequentially or mapreduce way )
3) Run ClusterOutputPostProcessorDriver ( the same way as in step 2 : 
sequentially or mapreduce way ) and after that, read vectors of the clusters
4) Analyze whether the vectors have been clustered properly according to 
your requirement.

On 15-12-2011 20:01, ipshita chatterji wrote:
> I still get a mismatch
> between the number of clusters generated by clusterdumper and after
> reading the members.

Re: Query on clusterdumper output and clusteredPoints

Posted by ipshita chatterji <si...@gmail.com>.

Hi,
I wrote my own code to read member variables from one of the
directories generated by the postprocessor. I still get a mismatch
between the number of clusters generated by clusterdumper and after
reading the members.

Please see my code snippet below. What am I missing here?
Clusterdumper displays:

MSV-115{n=3 c=[0:-0.030, 1:0.003, 2:-0.032, 3:-0.053, 4:0.001,.................]

which means there are 3 members belonging to this centroid where as
the code below generates 412 points.
<code>
    Configuration conf = new Configuration();
    //FileSystem fs = FileSystem.get(pointsDir.toUri(), conf);
    FileSystem fs = pointsDir.getFileSystem(conf);
    Path mypath = new Path("output1512/pp/115");
    //System.out.println(" fs "+fs.getName());
    try{
          process(mypath,fs,conf);
       }
    catch(Exception e)
    {
          System.out.println("Exception :: "+e.getMessage());
          e.printStackTrace();
    }


public void process(Path clusteredPoints, FileSystem
fileSystem,Configuration conf)throws Exception {
     FileStatus[] partFiles =
getAllClusteredPointPartFiles(clusteredPoints,fileSystem);
     for (FileStatus partFile : partFiles) {
  	  SequenceFile.Reader clusteredPointsReader = new
SequenceFile.Reader(fileSystem, partFile.getPath(), conf);
         WritableComparable clusterIdAsKey = (WritableComparable)
clusteredPointsReader.getKeyClass().newInstance();
         Writable vector = (Writable)
clusteredPointsReader.getValueClass().newInstance();
         while (clusteredPointsReader.next(clusterIdAsKey, vector)) {
             //use clusterId and vector here to write to a local file.
             //IntWritable clusterIdAsKey1 = new IntWritable();
             Text clusterIdAsKey1 = new Text();
             //WeightedVectorWritable point1 = new WeightedVectorWritable();
             VectorWritable point1 = new VectorWritable();

             findClusterAndAddVector(clusteredPointsReader,
clusterIdAsKey1, point1);
         }
         clusteredPointsReader.close();
     }
   }

  private void findClusterAndAddVector(SequenceFile.Reader
clusteredPointsReader,
                                       //IntWritable clusterIdAsKey1,
                                       Text clusterIdAsKey1,
                                       VectorWritable point1) throws
IOException {
    while (clusteredPointsReader.next(clusterIdAsKey1, point1)) {
      //String clusterId = clusterIdAsKey1.toString().trim();
      //String point = point1.toString();
      //System.out.println("Adding point to cluster " + clusterId);
      org.apache.mahout.math.Vector vec= point1.get();
      System.out.println(vec.asFormatString());
    }
  }


private FileStatus[] getAllClusteredPointPartFiles(Path
clusteredPoints, FileSystem fileSystem) throws IOException {
     System.out.println(" clusteredPoints :: "+clusteredPoints.getName());
     System.out.println(" fileSystem:: "+fileSystem.getName());

     //Path[] partFilePaths =
FileUtil.stat2Paths(fileSystem.globStatus(clusteredPoints,
PathFilters.partFilter()));
     Path[] partFilePaths =
FileUtil.stat2Paths(fileSystem.globStatus(clusteredPoints));

     int size=partFilePaths.length;
     System.out.println("Lenght :: "+size);
     FileStatus[] partFileStatuses =
fileSystem.listStatus(partFilePaths, PathFilters.partFilter());
     return partFileStatuses;
   }



On Thu, Dec 15, 2011 at 2:19 PM, Paritosh Ranjan <pr...@xebia.com> wrote:
> If you want to put data in to local file system, I think you will have to
> read the data in the cluster directories (output of postprocessor), one by
> one and write it on your local system.
>
> I am not sure what ClusterDumper does, if it also does the same thing(reads
> clusters output and writes output on local file system), then you can use it
> on all the directories produced by postprocessor.
>
>
> On 15-12-2011 14:07, ipshita chatterji wrote:
>>
>> I have used ClusterOutputPostProcessorDriver. Now how do I read the
>> output generated by postprocessor? Is there a tool for that too?
>>
>> On Thu, Dec 15, 2011 at 10:37 AM, Paritosh Ranjan<pr...@xebia.com>
>>  wrote:
>>>
>>> Some typo in previous mail. Please read :
>>>
>>> ...which will post process your clustering output and group vectors
>>> belonging to different clusters in their respective directories...
>>>
>>>
>>> On 15-12-2011 10:34, Paritosh Ranjan wrote:
>>>>
>>>> You don't need to write your own code for analyzing clustered points.
>>>> You
>>>> can use ClusterOutputPostProcessorDriver which will post process your
>>>> clusters and group clusters belonging to different clusters in their
>>>> respective directories. You won't get any OOM here.
>>>>
>>>> Example of using it is here
>>>> https://cwiki.apache.org/MAHOUT/top-down-clustering.html.
>>>>
>>>> And I would advice to use the current 0.6-snapshot snapshot to do
>>>> clustering as well as post processing it.
>>>> Using 0.5 to use clustering and 0.6-snapshot to write code to post
>>>> process
>>>> might create problems.
>>>>
>>>> Paritosh
>>>>
>>>> On 15-12-2011 08:37, ipshita chatterji wrote:
>>>>>
>>>>> Actually clustering was done using 0.5 version of mahout but I am
>>>>> using the clusterterdumper code from current version of mahout present
>>>>> in "trunk" to analyze the clusters. To make it run I renamed the final
>>>>> cluster by appending "-final".
>>>>> I got the OOM error even after increasing the mahout heapsize and
>>>>> hence had written a code of my own to analyze the clusters by reading
>>>>> "-clusteredPoints".
>>>>>
>>>>> Thu, Dec 15, 2011 at 2:58 AM, Gary Snider<ga...@gmail.com>
>>>>>  wrote:
>>>>>
>>>>>> Ok.  See if you can get the --pointsDir working and post what you get.
>>>>>>  Also for seqFileDir do you have a directory with the word 'final' in
>>>>>> it?
>>>>>>
>>>>>> On Dec 14, 2011, at 12:37 PM, ipshita chatterji<si...@gmail.com>
>>>>>>  wrote:
>>>>>>
>>>>>>> For clusterdumper I had following commandline:
>>>>>>>
>>>>>>> $MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-6
>>>>>>> --output clusteranalyze.txt
>>>>>>>
>>>>>>> Have written a separate program to read clusteredOutput directory as
>>>>>>> clusterdumper with "--pointsDir output/clusteredPoints " was giving
>>>>>>> OOM exception.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> On Wed, Dec 14, 2011 at 10:06 PM, Gary
>>>>>>> Snider<ga...@gmail.com>
>>>>>>>  wrote:
>>>>>>>>
>>>>>>>> What was on your command line?  e.g. seqFileDir, pointsDir, etc
>>>>>>>>
>>>>>>>> On Wed, Dec 14, 2011 at 10:54 AM, ipshita
>>>>>>>> chatterji<si...@gmail.com>wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I am a newbie in Mahout and also have elementary knowledge of
>>>>>>>>> clustering. I managed to cluster my data using meanshift and then
>>>>>>>>> ran
>>>>>>>>> clusterdumper, I get following output:
>>>>>>>>>
>>>>>>>>> MSV-21{n=1 c=[1:0...........]
>>>>>>>>>
>>>>>>>>> So I asssume that the cluster above has converged and n=1 indicates
>>>>>>>>> that there is only one point associated with the cluster above.
>>>>>>>>>
>>>>>>>>> Now I try to read the members of this cluster from
>>>>>>>>> "clusteredPoints"
>>>>>>>>> directory. I see from the output that number of points belonging
>>>>>>>>> this
>>>>>>>>> cluster is 173.
>>>>>>>>>
>>>>>>>>> Why is this mismatch happening? Am I missing something here?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Ipshita
>>>>>>>>>
>>>>> -----
>>>>> No virus found in this message.
>>>>> Checked by AVG - www.avg.com
>>>>> Version: 10.0.1415 / Virus Database: 2102/4080 - Release Date: 12/14/11
>>>>>
>>>>
>>>>
>>>> -----
>>>> No virus found in this message.
>>>> Checked by AVG - www.avg.com
>>>> Version: 10.0.1415 / Virus Database: 2108/4081 - Release Date: 12/14/11
>>>
>>>
>>
>> -----
>> No virus found in this message.
>> Checked by AVG - www.avg.com
>> Version: 10.0.1415 / Virus Database: 2108/4081 - Release Date: 12/14/11
>>
>

Re: Query on clusterdumper output and clusteredPoints

Posted by Paritosh Ranjan <pr...@xebia.com>.

If you want to put data in to local file system, I think you will have 
to read the data in the cluster directories (output of postprocessor), 
one by one and write it on your local system.

I am not sure what ClusterDumper does, if it also does the same 
thing(reads clusters output and writes output on local file system), 
then you can use it on all the directories produced by postprocessor.

On 15-12-2011 14:07, ipshita chatterji wrote:
> I have used ClusterOutputPostProcessorDriver. Now how do I read the
> output generated by postprocessor? Is there a tool for that too?
>
> On Thu, Dec 15, 2011 at 10:37 AM, Paritosh Ranjan<pr...@xebia.com>  wrote:
>> Some typo in previous mail. Please read :
>>
>> ...which will post process your clustering output and group vectors
>> belonging to different clusters in their respective directories...
>>
>>
>> On 15-12-2011 10:34, Paritosh Ranjan wrote:
>>> You don't need to write your own code for analyzing clustered points. You
>>> can use ClusterOutputPostProcessorDriver which will post process your
>>> clusters and group clusters belonging to different clusters in their
>>> respective directories. You won't get any OOM here.
>>>
>>> Example of using it is here
>>> https://cwiki.apache.org/MAHOUT/top-down-clustering.html.
>>>
>>> And I would advice to use the current 0.6-snapshot snapshot to do
>>> clustering as well as post processing it.
>>> Using 0.5 to use clustering and 0.6-snapshot to write code to post process
>>> might create problems.
>>>
>>> Paritosh
>>>
>>> On 15-12-2011 08:37, ipshita chatterji wrote:
>>>> Actually clustering was done using 0.5 version of mahout but I am
>>>> using the clusterterdumper code from current version of mahout present
>>>> in "trunk" to analyze the clusters. To make it run I renamed the final
>>>> cluster by appending "-final".
>>>> I got the OOM error even after increasing the mahout heapsize and
>>>> hence had written a code of my own to analyze the clusters by reading
>>>> "-clusteredPoints".
>>>>
>>>> Thu, Dec 15, 2011 at 2:58 AM, Gary Snider<ga...@gmail.com>
>>>>   wrote:
>>>>
>>>>> Ok.  See if you can get the --pointsDir working and post what you get.
>>>>>   Also for seqFileDir do you have a directory with the word 'final' in it?
>>>>>
>>>>> On Dec 14, 2011, at 12:37 PM, ipshita chatterji<si...@gmail.com>
>>>>>   wrote:
>>>>>
>>>>>> For clusterdumper I had following commandline:
>>>>>>
>>>>>> $MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-6
>>>>>> --output clusteranalyze.txt
>>>>>>
>>>>>> Have written a separate program to read clusteredOutput directory as
>>>>>> clusterdumper with "--pointsDir output/clusteredPoints " was giving
>>>>>> OOM exception.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> On Wed, Dec 14, 2011 at 10:06 PM, Gary Snider<ga...@gmail.com>
>>>>>>   wrote:
>>>>>>> What was on your command line?  e.g. seqFileDir, pointsDir, etc
>>>>>>>
>>>>>>> On Wed, Dec 14, 2011 at 10:54 AM, ipshita
>>>>>>> chatterji<si...@gmail.com>wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I am a newbie in Mahout and also have elementary knowledge of
>>>>>>>> clustering. I managed to cluster my data using meanshift and then ran
>>>>>>>> clusterdumper, I get following output:
>>>>>>>>
>>>>>>>> MSV-21{n=1 c=[1:0...........]
>>>>>>>>
>>>>>>>> So I asssume that the cluster above has converged and n=1 indicates
>>>>>>>> that there is only one point associated with the cluster above.
>>>>>>>>
>>>>>>>> Now I try to read the members of this cluster from "clusteredPoints"
>>>>>>>> directory. I see from the output that number of points belonging this
>>>>>>>> cluster is 173.
>>>>>>>>
>>>>>>>> Why is this mismatch happening? Am I missing something here?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Ipshita
>>>>>>>>
>>>> -----
>>>> No virus found in this message.
>>>> Checked by AVG - www.avg.com
>>>> Version: 10.0.1415 / Virus Database: 2102/4080 - Release Date: 12/14/11
>>>>
>>>
>>>
>>> -----
>>> No virus found in this message.
>>> Checked by AVG - www.avg.com
>>> Version: 10.0.1415 / Virus Database: 2108/4081 - Release Date: 12/14/11
>>
>
> -----
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 10.0.1415 / Virus Database: 2108/4081 - Release Date: 12/14/11
>

Re: Query on clusterdumper output and clusteredPoints

Posted by ipshita chatterji <si...@gmail.com>.

I have used ClusterOutputPostProcessorDriver. Now how do I read the
output generated by postprocessor? Is there a tool for that too?

On Thu, Dec 15, 2011 at 10:37 AM, Paritosh Ranjan <pr...@xebia.com> wrote:
> Some typo in previous mail. Please read :
>
> ...which will post process your clustering output and group vectors
> belonging to different clusters in their respective directories...
>
>
> On 15-12-2011 10:34, Paritosh Ranjan wrote:
>>
>> You don't need to write your own code for analyzing clustered points. You
>> can use ClusterOutputPostProcessorDriver which will post process your
>> clusters and group clusters belonging to different clusters in their
>> respective directories. You won't get any OOM here.
>>
>> Example of using it is here
>> https://cwiki.apache.org/MAHOUT/top-down-clustering.html.
>>
>> And I would advice to use the current 0.6-snapshot snapshot to do
>> clustering as well as post processing it.
>> Using 0.5 to use clustering and 0.6-snapshot to write code to post process
>> might create problems.
>>
>> Paritosh
>>
>> On 15-12-2011 08:37, ipshita chatterji wrote:
>>>
>>> Actually clustering was done using 0.5 version of mahout but I am
>>> using the clusterterdumper code from current version of mahout present
>>> in "trunk" to analyze the clusters. To make it run I renamed the final
>>> cluster by appending "-final".
>>> I got the OOM error even after increasing the mahout heapsize and
>>> hence had written a code of my own to analyze the clusters by reading
>>> "-clusteredPoints".
>>>
>>> Thu, Dec 15, 2011 at 2:58 AM, Gary Snider<ga...@gmail.com>
>>>  wrote:
>>>
>>>> Ok.  See if you can get the --pointsDir working and post what you get.
>>>>  Also for seqFileDir do you have a directory with the word 'final' in it?
>>>>
>>>> On Dec 14, 2011, at 12:37 PM, ipshita chatterji<si...@gmail.com>
>>>>  wrote:
>>>>
>>>>> For clusterdumper I had following commandline:
>>>>>
>>>>> $MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-6
>>>>> --output clusteranalyze.txt
>>>>>
>>>>> Have written a separate program to read clusteredOutput directory as
>>>>> clusterdumper with "--pointsDir output/clusteredPoints " was giving
>>>>> OOM exception.
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Wed, Dec 14, 2011 at 10:06 PM, Gary Snider<ga...@gmail.com>
>>>>>  wrote:
>>>>>>
>>>>>> What was on your command line?  e.g. seqFileDir, pointsDir, etc
>>>>>>
>>>>>> On Wed, Dec 14, 2011 at 10:54 AM, ipshita
>>>>>> chatterji<si...@gmail.com>wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am a newbie in Mahout and also have elementary knowledge of
>>>>>>> clustering. I managed to cluster my data using meanshift and then ran
>>>>>>> clusterdumper, I get following output:
>>>>>>>
>>>>>>> MSV-21{n=1 c=[1:0...........]
>>>>>>>
>>>>>>> So I asssume that the cluster above has converged and n=1 indicates
>>>>>>> that there is only one point associated with the cluster above.
>>>>>>>
>>>>>>> Now I try to read the members of this cluster from "clusteredPoints"
>>>>>>> directory. I see from the output that number of points belonging this
>>>>>>> cluster is 173.
>>>>>>>
>>>>>>> Why is this mismatch happening? Am I missing something here?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ipshita
>>>>>>>
>>>
>>> -----
>>> No virus found in this message.
>>> Checked by AVG - www.avg.com
>>> Version: 10.0.1415 / Virus Database: 2102/4080 - Release Date: 12/14/11
>>>
>>
>>
>>
>> -----
>> No virus found in this message.
>> Checked by AVG - www.avg.com
>> Version: 10.0.1415 / Virus Database: 2108/4081 - Release Date: 12/14/11
>
>

Re: Query on clusterdumper output and clusteredPoints

Posted by Paritosh Ranjan <pr...@xebia.com>.

Some typo in previous mail. Please read :

...which will post process your clustering output and group vectors 
belonging to different clusters in their respective directories...

On 15-12-2011 10:34, Paritosh Ranjan wrote:
> You don't need to write your own code for analyzing clustered points. 
> You can use ClusterOutputPostProcessorDriver which will post process 
> your clusters and group clusters belonging to different clusters in 
> their respective directories. You won't get any OOM here.
>
> Example of using it is here 
> https://cwiki.apache.org/MAHOUT/top-down-clustering.html.
>
> And I would advice to use the current 0.6-snapshot snapshot to do 
> clustering as well as post processing it.
> Using 0.5 to use clustering and 0.6-snapshot to write code to post 
> process might create problems.
>
> Paritosh
>
> On 15-12-2011 08:37, ipshita chatterji wrote:
>> Actually clustering was done using 0.5 version of mahout but I am
>> using the clusterterdumper code from current version of mahout present
>> in "trunk" to analyze the clusters. To make it run I renamed the final
>> cluster by appending "-final".
>> I got the OOM error even after increasing the mahout heapsize and
>> hence had written a code of my own to analyze the clusters by reading
>> "-clusteredPoints".
>>
>> Thu, Dec 15, 2011 at 2:58 AM, Gary Snider<ga...@gmail.com>  
>> wrote:
>>
>>> Ok.  See if you can get the --pointsDir working and post what you 
>>> get.  Also for seqFileDir do you have a directory with the word 
>>> 'final' in it?
>>>
>>> On Dec 14, 2011, at 12:37 PM, ipshita 
>>> chatterji<si...@gmail.com>  wrote:
>>>
>>>> For clusterdumper I had following commandline:
>>>>
>>>> $MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-6
>>>> --output clusteranalyze.txt
>>>>
>>>> Have written a separate program to read clusteredOutput directory as
>>>> clusterdumper with "--pointsDir output/clusteredPoints " was giving
>>>> OOM exception.
>>>>
>>>> Thanks
>>>>
>>>> On Wed, Dec 14, 2011 at 10:06 PM, Gary 
>>>> Snider<ga...@gmail.com>  wrote:
>>>>> What was on your command line?  e.g. seqFileDir, pointsDir, etc
>>>>>
>>>>> On Wed, Dec 14, 2011 at 10:54 AM, ipshita 
>>>>> chatterji<si...@gmail.com>wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am a newbie in Mahout and also have elementary knowledge of
>>>>>> clustering. I managed to cluster my data using meanshift and then 
>>>>>> ran
>>>>>> clusterdumper, I get following output:
>>>>>>
>>>>>> MSV-21{n=1 c=[1:0...........]
>>>>>>
>>>>>> So I asssume that the cluster above has converged and n=1 indicates
>>>>>> that there is only one point associated with the cluster above.
>>>>>>
>>>>>> Now I try to read the members of this cluster from "clusteredPoints"
>>>>>> directory. I see from the output that number of points belonging 
>>>>>> this
>>>>>> cluster is 173.
>>>>>>
>>>>>> Why is this mismatch happening? Am I missing something here?
>>>>>>
>>>>>> Thanks,
>>>>>> Ipshita
>>>>>>
>>
>> -----
>> No virus found in this message.
>> Checked by AVG - www.avg.com
>> Version: 10.0.1415 / Virus Database: 2102/4080 - Release Date: 12/14/11
>>
>
>
>
> -----
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 10.0.1415 / Virus Database: 2108/4081 - Release Date: 12/14/11

Re: Query on clusterdumper output and clusteredPoints

Posted by Paritosh Ranjan <pr...@xebia.com>.

You don't need to write your own code for analyzing clustered points. 
You can use ClusterOutputPostProcessorDriver which will post process 
your clusters and group clusters belonging to different clusters in 
their respective directories. You won't get any OOM here.

Example of using it is here 
https://cwiki.apache.org/MAHOUT/top-down-clustering.html.

And I would advice to use the current 0.6-snapshot snapshot to do 
clustering as well as post processing it.
Using 0.5 to use clustering and 0.6-snapshot to write code to post 
process might create problems.

Paritosh

On 15-12-2011 08:37, ipshita chatterji wrote:
> Actually clustering was done using 0.5 version of mahout but I am
> using the clusterterdumper code from current version of mahout present
> in "trunk" to analyze the clusters. To make it run I renamed the final
> cluster by appending "-final".
> I got the OOM error even after increasing the mahout heapsize and
> hence had written a code of my own to analyze the clusters by reading
> "-clusteredPoints".
>
> Thu, Dec 15, 2011 at 2:58 AM, Gary Snider<ga...@gmail.com>  wrote:
>
>> Ok.  See if you can get the --pointsDir working and post what you get.  Also for seqFileDir do you have a directory with the word 'final' in it?
>>
>> On Dec 14, 2011, at 12:37 PM, ipshita chatterji<si...@gmail.com>  wrote:
>>
>>> For clusterdumper I had following commandline:
>>>
>>> $MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-6
>>> --output clusteranalyze.txt
>>>
>>> Have written a separate program to read clusteredOutput directory as
>>> clusterdumper with "--pointsDir output/clusteredPoints " was giving
>>> OOM exception.
>>>
>>> Thanks
>>>
>>> On Wed, Dec 14, 2011 at 10:06 PM, Gary Snider<ga...@gmail.com>  wrote:
>>>> What was on your command line?  e.g. seqFileDir, pointsDir, etc
>>>>
>>>> On Wed, Dec 14, 2011 at 10:54 AM, ipshita chatterji<si...@gmail.com>wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am a newbie in Mahout and also have elementary knowledge of
>>>>> clustering. I managed to cluster my data using meanshift and then ran
>>>>> clusterdumper, I get following output:
>>>>>
>>>>> MSV-21{n=1 c=[1:0...........]
>>>>>
>>>>> So I asssume that the cluster above has converged and n=1 indicates
>>>>> that there is only one point associated with the cluster above.
>>>>>
>>>>> Now I try to read the members of this cluster from "clusteredPoints"
>>>>> directory. I see from the output that number of points belonging this
>>>>> cluster is 173.
>>>>>
>>>>> Why is this mismatch happening? Am I missing something here?
>>>>>
>>>>> Thanks,
>>>>> Ipshita
>>>>>
>
> -----
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 10.0.1415 / Virus Database: 2102/4080 - Release Date: 12/14/11
>

Re: Query on clusterdumper output and clusteredPoints

Posted by ipshita chatterji <si...@gmail.com>.

Actually clustering was done using 0.5 version of mahout but I am
using the clusterterdumper code from current version of mahout present
in "trunk" to analyze the clusters. To make it run I renamed the final
cluster by appending "-final".
I got the OOM error even after increasing the mahout heapsize and
hence had written a code of my own to analyze the clusters by reading
"-clusteredPoints".

Thu, Dec 15, 2011 at 2:58 AM, Gary Snider <ga...@gmail.com> wrote:
> Ok.  See if you can get the --pointsDir working and post what you get.  Also for seqFileDir do you have a directory with the word 'final' in it?
>
> On Dec 14, 2011, at 12:37 PM, ipshita chatterji <si...@gmail.com> wrote:
>
>> For clusterdumper I had following commandline:
>>
>> $MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-6
>> --output clusteranalyze.txt
>>
>> Have written a separate program to read clusteredOutput directory as
>> clusterdumper with "--pointsDir output/clusteredPoints " was giving
>> OOM exception.
>>
>> Thanks
>>
>> On Wed, Dec 14, 2011 at 10:06 PM, Gary Snider <ga...@gmail.com> wrote:
>>> What was on your command line?  e.g. seqFileDir, pointsDir, etc
>>>
>>> On Wed, Dec 14, 2011 at 10:54 AM, ipshita chatterji <si...@gmail.com>wrote:
>>>
>>>> Hi,
>>>>
>>>> I am a newbie in Mahout and also have elementary knowledge of
>>>> clustering. I managed to cluster my data using meanshift and then ran
>>>> clusterdumper, I get following output:
>>>>
>>>> MSV-21{n=1 c=[1:0...........]
>>>>
>>>> So I asssume that the cluster above has converged and n=1 indicates
>>>> that there is only one point associated with the cluster above.
>>>>
>>>> Now I try to read the members of this cluster from "clusteredPoints"
>>>> directory. I see from the output that number of points belonging this
>>>> cluster is 173.
>>>>
>>>> Why is this mismatch happening? Am I missing something here?
>>>>
>>>> Thanks,
>>>> Ipshita
>>>>

Re: Query on clusterdumper output and clusteredPoints

Posted by Gary Snider <ga...@gmail.com>.

Ok.  See if you can get the --pointsDir working and post what you get.  Also for seqFileDir do you have a directory with the word 'final' in it?

On Dec 14, 2011, at 12:37 PM, ipshita chatterji <si...@gmail.com> wrote:

> For clusterdumper I had following commandline:
> 
> $MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-6
> --output clusteranalyze.txt
> 
> Have written a separate program to read clusteredOutput directory as
> clusterdumper with "--pointsDir output/clusteredPoints " was giving
> OOM exception.
> 
> Thanks
> 
> On Wed, Dec 14, 2011 at 10:06 PM, Gary Snider <ga...@gmail.com> wrote:
>> What was on your command line?  e.g. seqFileDir, pointsDir, etc
>> 
>> On Wed, Dec 14, 2011 at 10:54 AM, ipshita chatterji <si...@gmail.com>wrote:
>> 
>>> Hi,
>>> 
>>> I am a newbie in Mahout and also have elementary knowledge of
>>> clustering. I managed to cluster my data using meanshift and then ran
>>> clusterdumper, I get following output:
>>> 
>>> MSV-21{n=1 c=[1:0...........]
>>> 
>>> So I asssume that the cluster above has converged and n=1 indicates
>>> that there is only one point associated with the cluster above.
>>> 
>>> Now I try to read the members of this cluster from "clusteredPoints"
>>> directory. I see from the output that number of points belonging this
>>> cluster is 173.
>>> 
>>> Why is this mismatch happening? Am I missing something here?
>>> 
>>> Thanks,
>>> Ipshita
>>>

Re: Query on clusterdumper output and clusteredPoints

Posted by Suneel Marthi <su...@yahoo.com>.

Ensure that you increase the JVM memory settings when running the clusterdump program to avoid OOM.



________________________________
 From: ipshita chatterji <si...@gmail.com>
To: user@mahout.apache.org 
Sent: Wednesday, December 14, 2011 12:37 PM
Subject: Re: Query on clusterdumper output and clusteredPoints
 
For clusterdumper I had following commandline:

$MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-6
--output clusteranalyze.txt

Have written a separate program to read clusteredOutput directory as
clusterdumper with "--pointsDir output/clusteredPoints " was giving
OOM exception.

Thanks

On Wed, Dec 14, 2011 at 10:06 PM, Gary Snider <ga...@gmail.com> wrote:
> What was on your command line?  e.g. seqFileDir, pointsDir, etc
>
> On Wed, Dec 14, 2011 at 10:54 AM, ipshita chatterji <si...@gmail.com>wrote:
>
>> Hi,
>>
>> I am a newbie in Mahout and also have elementary knowledge of
>> clustering. I managed to cluster my data using meanshift and then ran
>> clusterdumper, I get following output:
>>
>> MSV-21{n=1 c=[1:0...........]
>>
>> So I asssume that the cluster above has converged and n=1 indicates
>> that there is only one point associated with the cluster above.
>>
>> Now I try to read the members of this cluster from "clusteredPoints"
>> directory. I see from the output that number of points belonging this
>> cluster is 173.
>>
>> Why is this mismatch happening? Am I missing something here?
>>
>> Thanks,
>> Ipshita
>>

Re: Query on clusterdumper output and clusteredPoints

Posted by ipshita chatterji <si...@gmail.com>.

For clusterdumper I had following commandline:

$MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-6
--output clusteranalyze.txt

Have written a separate program to read clusteredOutput directory as
clusterdumper with "--pointsDir output/clusteredPoints " was giving
OOM exception.

Thanks

On Wed, Dec 14, 2011 at 10:06 PM, Gary Snider <ga...@gmail.com> wrote:
> What was on your command line?  e.g. seqFileDir, pointsDir, etc
>
> On Wed, Dec 14, 2011 at 10:54 AM, ipshita chatterji <si...@gmail.com>wrote:
>
>> Hi,
>>
>> I am a newbie in Mahout and also have elementary knowledge of
>> clustering. I managed to cluster my data using meanshift and then ran
>> clusterdumper, I get following output:
>>
>> MSV-21{n=1 c=[1:0...........]
>>
>> So I asssume that the cluster above has converged and n=1 indicates
>> that there is only one point associated with the cluster above.
>>
>> Now I try to read the members of this cluster from "clusteredPoints"
>> directory. I see from the output that number of points belonging this
>> cluster is 173.
>>
>> Why is this mismatch happening? Am I missing something here?
>>
>> Thanks,
>> Ipshita
>>

Re: Query on clusterdumper output and clusteredPoints

Posted by Gary Snider <ga...@gmail.com>.

What was on your command line?  e.g. seqFileDir, pointsDir, etc

On Wed, Dec 14, 2011 at 10:54 AM, ipshita chatterji <si...@gmail.com>wrote:

> Hi,
>
> I am a newbie in Mahout and also have elementary knowledge of
> clustering. I managed to cluster my data using meanshift and then ran
> clusterdumper, I get following output:
>
> MSV-21{n=1 c=[1:0...........]
>
> So I asssume that the cluster above has converged and n=1 indicates
> that there is only one point associated with the cluster above.
>
> Now I try to read the members of this cluster from "clusteredPoints"
> directory. I see from the output that number of points belonging this
> cluster is 173.
>
> Why is this mismatch happening? Am I missing something here?
>
> Thanks,
> Ipshita
>