You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Bikash Gupta <bi...@gmail.com> on 2014/02/19 12:14:31 UTC

Cluster Dumper in 0.9

Hi,

After running the cluster dumper on Kmeans output I am getting only
Key of Sequence File.

Options provided for cluster dumper is:-

-i <<cluster-*-final of Kmeans>> -o <<Output File>>  -p
<<clusteredPoint>> -of CSV

Is it something that I am missing.

PN: I am using sequential mode.

-- 
Regards
Bikash Gupta

Re: Cluster Dumper in 0.9

Posted by Suneel Marthi <su...@yahoo.com>.
In the same separate post that u r alluring to it was also discussed that you should upgrade to 0.9 which fixes that issue and running a seqdumper on clustered output should give the weight of the vectors and the distance of each vector from the cluster centroid.

Did u try running a seqdumper on the clustered output?






On Sunday, February 23, 2014 10:32 PM, Bikash Gupta <bi...@gmail.com> wrote:
 
Thanks, make sense.

Now in a seperate post we discussed that "The Clustered output should display the vectors with the vectorid that belong to a specfic cluster along with the distance of that vector from the cluster center."

So, based on the above code, we are loosing few things for named vector

1. Weightage of vector, as its only prints vector name
2. Distance of that vector from the cluster center.

Will it be a good idea to modify the above code? 




On Mon, Feb 24, 2014 at 6:05 AM, Suneel Marthi <su...@yahoo.com> wrote:

The key in the CSV is the clusterId (and not the named vector).
>
>Here's the complete code snippet which should make sense.
>
>{Code}
>
>    Cluster cluster = clusterWritable.getValue();
>    line.append(cluster.getId());
>    List<WeightedPropertyVectorWritable> points = getClusterIdToPoints().get(cluster.getId());
>    if (points != null) {
>      for
 (WeightedPropertyVectorWritable point : points) {
>        Vector theVec = point.getVector();
>        line.append(',');
>
>        if (theVec instanceof NamedVector) {
>         
 line.append(((NamedVector)theVec).getName());
>        } else {
>          String vecStr = theVec.asFormatString();
>          //do some basic manipulations for display
>          vecStr = VEC_PATTERN.matcher(vecStr).replaceAll("_");
>          line.append(vecStr);
>        }
>      }
>      getWriter().append(line).append("\n");
>    }
>
>
>{Code}
>
>For each clusterId it prints the names of the Named Vectors in the cluster or the vector
 itself (if not a named vector).
>Hope that clarifies.
>
>
>
>
>
>
>
>
>
>On Friday, February 21, 2014 2:13 AM, Bikash Gupta <bi...@gmail.com> wrote:
> 
>Suneel,
>
>I was going through code of CSVClusterWriter and found that if
 vector
>is an instance of NamedVector then it writes only Key.
>
>if (theVec instanceof NamedVector) {
>          line.append(((NamedVector)theVec).getName());
>        } else {
>          String vecStr = theVec.asFormatString();
>          //do some basic manipulations for display
>          vecStr = VEC_PATTERN.matcher(vecStr).replaceAll("_");
>          line.append(vecStr);
>        }
>
>Hence I am getting only key as an ouput of cluster dumper. Request you
>to specify the design assumption behind this....
>
>On Wed, Feb 19, 2014 at 10:36 PM, Bikash Gupta <bi...@gmail.com> wrote:
>> I am running cluster
 dumper
>>
>> After extracting output from Cluster dump I am transposing the row to
>> column, hence I have directly called this class from my java code.
>>
>> Code:
>>
>> ClusterDumper.main(new String[] {
>>                 buildOption(DefaultOptionCreator.INPUT_OPTION),seqFileDir,
>>                 buildOption(DefaultOptionCreator.OUTPUT_OPTION),outputFile,
>>                 buildOption(ClusterDumper.OUTPUT_FORMAT_OPT),format,
>>                 buildOption(ClusterDumper.POINTS_DIR_OPTION),pointsDir
>>                 });
>>
>> I have attached output too. Please note Key of Sequence File is
>> Text.class and its seperated using "`" character. I have also attached
>> Cluster
 Metadata
>>
>>
>>
>>
>> On Wed, Feb 19, 2014 at 9:21 PM, Suneel Marthi <su...@yahoo.com> wrote:
>>> R u running clusterdump or seqdumper?
>>>
>>> Could u paste the commands that u had run and their respective outputs?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wednesday, February 19, 2014 6:16 AM, Bikash Gupta <bi...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> After running the cluster dumper on Kmeans output I am getting only
>>> Key of Sequence File.
>>>
>>> Options provided for cluster dumper is:-
>>>
>>> -i <<cluster-*-final of Kmeans>> -o <<Output
 File>>  -p
>>> <<clusteredPoint>> -of CSV
>>>
>>> Is it something that I am missing.
>>>
>>> PN: I am using sequential mode.
>>>
>>> --
>>> Regards
>>> Bikash Gupta
>>
>>
>>
>> --
>> Thanks & Regards
>> Bikash Kumar Gupta
>
>
>
>-- 
>Thanks & Regards
>Bikash Kumar Gupta
>
>
>


-- 
Thanks & Regards
Bikash Kumar Gupta 

Re: Cluster Dumper in 0.9

Posted by Bikash Gupta <bi...@gmail.com>.
Thanks, make sense.

Now in a seperate post we discussed that

*"The Clustered output should display the vectors with the vectorid that
belong to a specfic cluster along with the distance of that vector from the
cluster center."*
So, based on the above code, we are loosing few things for named vector

1. Weightage of vector, as its only prints vector name
2. Distance of that vector from the cluster center.

Will it be a good idea to modify the above code?


On Mon, Feb 24, 2014 at 6:05 AM, Suneel Marthi <su...@yahoo.com>wrote:

> The key in the CSV is the clusterId (and not the named vector).
>
> Here's the complete code snippet which should make sense.
>
> {Code}
>
>     Cluster cluster = clusterWritable.getValue();
>     line.append(cluster.getId());
>     List<WeightedPropertyVectorWritable> points =
> getClusterIdToPoints().get(cluster.getId());
>     if (points != null) {
>       for (WeightedPropertyVectorWritable point : points) {
>         Vector theVec = point.getVector();
>         line.append(',');
>
>         if (theVec instanceof NamedVector) {
>           line.append(((NamedVector)theVec).getName());
>         } else {
>           String vecStr = theVec.asFormatString();
>           //do some basic manipulations for display
>           vecStr = VEC_PATTERN.matcher(vecStr).replaceAll("_");
>           line.append(vecStr);
>         }
>       }
>       getWriter().append(line).append("\n");
>     }
>
>
> {Code}
>
> For each clusterId it prints the names of the Named Vectors in the cluster
> or the vector itself (if not a named vector).
> Hope that clarifies.
>
>
>
>
>
>
>   On Friday, February 21, 2014 2:13 AM, Bikash Gupta <
> bikash.gupta11@gmail.com> wrote:
>  Suneel,
>
> I was going through code of CSVClusterWriter and found that if vector
> is an instance of NamedVector then it writes only Key.
>
> if (theVec instanceof NamedVector) {
>           line.append(((NamedVector)theVec).getName());
>         } else {
>           String vecStr = theVec.asFormatString();
>           //do some basic manipulations for display
>           vecStr = VEC_PATTERN.matcher(vecStr).replaceAll("_");
>           line.append(vecStr);
>         }
>
> Hence I am getting only key as an ouput of cluster dumper. Request you
> to specify the design assumption behind this....
>
> On Wed, Feb 19, 2014 at 10:36 PM, Bikash Gupta <bi...@gmail.com>
> wrote:
> > I am running cluster dumper
> >
> > After extracting output from Cluster dump I am transposing the row to
> > column, hence I have directly called this class from my java code.
> >
> > Code:
> >
> > ClusterDumper.main(new String[] {
> >                buildOption(DefaultOptionCreator.INPUT_OPTION),seqFileDir,
> >
> buildOption(DefaultOptionCreator.OUTPUT_OPTION),outputFile,
> >                buildOption(ClusterDumper.OUTPUT_FORMAT_OPT),format,
> >                buildOption(ClusterDumper.POINTS_DIR_OPTION),pointsDir
> >                });
> >
> > I have attached output too. Please note Key of Sequence File is
> > Text.class and its seperated using "`" character. I have also attached
> > Cluster Metadata
> >
> >
> >
> >
> > On Wed, Feb 19, 2014 at 9:21 PM, Suneel Marthi <su...@yahoo.com>
> wrote:
> >> R u running clusterdump or seqdumper?
> >>
> >> Could u paste the commands that u had run and their respective outputs?
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Wednesday, February 19, 2014 6:16 AM, Bikash Gupta <
> bikash.gupta11@gmail.com> wrote:
> >>
> >> Hi,
> >>
> >> After running the cluster dumper on Kmeans output I am getting only
> >> Key of Sequence File.
> >>
> >> Options provided for cluster dumper is:-
> >>
> >> -i <<cluster-*-final of Kmeans>> -o <<Output File>>  -p
> >> <<clusteredPoint>> -of CSV
> >>
> >> Is it something that I am missing.
> >>
> >> PN: I am using sequential mode.
> >>
> >> --
> >> Regards
> >> Bikash Gupta
> >
> >
> >
> > --
> > Thanks & Regards
> > Bikash Kumar Gupta
>
>
>
> --
> Thanks & Regards
> Bikash Kumar Gupta
>
>
>


-- 
Thanks & Regards
Bikash Kumar Gupta

Re: Cluster Dumper in 0.9

Posted by Suneel Marthi <su...@yahoo.com>.
The key in the CSV is the clusterId (and not the named vector).

Here's the complete code snippet which should make sense.

{Code}

    Cluster cluster = clusterWritable.getValue();
    line.append(cluster.getId());
    List<WeightedPropertyVectorWritable> points = getClusterIdToPoints().get(cluster.getId());
    if (points != null) {
      for (WeightedPropertyVectorWritable point : points) {
        Vector theVec = point.getVector();
        line.append(',');
        if (theVec instanceof NamedVector) {
         
 line.append(((NamedVector)theVec).getName());
        } else {
          String vecStr = theVec.asFormatString();
          //do some basic manipulations for display
          vecStr = VEC_PATTERN.matcher(vecStr).replaceAll("_");
          line.append(vecStr);
        }
      }
      getWriter().append(line).append("\n");
    }


{Code}

For each clusterId it prints the names of the Named Vectors in the cluster or the vector itself (if not a named vector).
Hope that clarifies.







On Friday, February 21, 2014 2:13 AM, Bikash Gupta <bi...@gmail.com> wrote:
 
Suneel,

I was going through code of CSVClusterWriter and found that if
 vector
is an instance of NamedVector then it writes only Key.

if (theVec instanceof NamedVector) {
          line.append(((NamedVector)theVec).getName());
        } else {
          String vecStr = theVec.asFormatString();
          //do some basic manipulations for display
          vecStr = VEC_PATTERN.matcher(vecStr).replaceAll("_");
          line.append(vecStr);
        }

Hence I am getting only key as an ouput of cluster dumper. Request you
to specify the design assumption behind this....

On Wed, Feb 19, 2014 at 10:36 PM, Bikash Gupta <bi...@gmail.com> wrote:
> I am running cluster
 dumper
>
> After extracting output from Cluster dump I am transposing the row to
> column, hence I have directly called this class from my java code.
>
> Code:
>
> ClusterDumper.main(new String[] {
>                 buildOption(DefaultOptionCreator.INPUT_OPTION),seqFileDir,
>                 buildOption(DefaultOptionCreator.OUTPUT_OPTION),outputFile,
>                 buildOption(ClusterDumper.OUTPUT_FORMAT_OPT),format,
>                 buildOption(ClusterDumper.POINTS_DIR_OPTION),pointsDir
>                 });
>
> I have attached output too. Please note Key of Sequence File is
> Text.class and its seperated using "`" character. I have also attached
> Cluster
 Metadata
>
>
>
>
> On Wed, Feb 19, 2014 at 9:21 PM, Suneel Marthi <su...@yahoo.com> wrote:
>> R u running clusterdump or seqdumper?
>>
>> Could u paste the commands that u had run and their respective outputs?
>>
>>
>>
>>
>>
>>
>>
>> On Wednesday, February 19, 2014 6:16 AM, Bikash Gupta <bi...@gmail.com> wrote:
>>
>> Hi,
>>
>> After running the cluster dumper on Kmeans output I am getting only
>> Key of Sequence File.
>>
>> Options provided for cluster dumper is:-
>>
>> -i <<cluster-*-final of Kmeans>> -o <<Output
 File>>  -p
>> <<clusteredPoint>> -of CSV
>>
>> Is it something that I am missing.
>>
>> PN: I am using sequential mode.
>>
>> --
>> Regards
>> Bikash Gupta
>
>
>
> --
> Thanks & Regards
> Bikash Kumar Gupta



-- 
Thanks & Regards
Bikash Kumar Gupta

Re: Cluster Dumper in 0.9

Posted by Bikash Gupta <bi...@gmail.com>.
Suneel,

I was going through code of CSVClusterWriter and found that if vector
is an instance of NamedVector then it writes only Key.

if (theVec instanceof NamedVector) {
          line.append(((NamedVector)theVec).getName());
        } else {
          String vecStr = theVec.asFormatString();
          //do some basic manipulations for display
          vecStr = VEC_PATTERN.matcher(vecStr).replaceAll("_");
          line.append(vecStr);
        }

Hence I am getting only key as an ouput of cluster dumper. Request you
to specify the design assumption behind this....

On Wed, Feb 19, 2014 at 10:36 PM, Bikash Gupta <bi...@gmail.com> wrote:
> I am running cluster dumper
>
> After extracting output from Cluster dump I am transposing the row to
> column, hence I have directly called this class from my java code.
>
> Code:
>
> ClusterDumper.main(new String[] {
>                 buildOption(DefaultOptionCreator.INPUT_OPTION),seqFileDir,
>                 buildOption(DefaultOptionCreator.OUTPUT_OPTION),outputFile,
>                 buildOption(ClusterDumper.OUTPUT_FORMAT_OPT),format,
>                 buildOption(ClusterDumper.POINTS_DIR_OPTION),pointsDir
>                 });
>
> I have attached output too. Please note Key of Sequence File is
> Text.class and its seperated using "`" character. I have also attached
> Cluster Metadata
>
>
>
>
> On Wed, Feb 19, 2014 at 9:21 PM, Suneel Marthi <su...@yahoo.com> wrote:
>> R u running clusterdump or seqdumper?
>>
>> Could u paste the commands that u had run and their respective outputs?
>>
>>
>>
>>
>>
>>
>>
>> On Wednesday, February 19, 2014 6:16 AM, Bikash Gupta <bi...@gmail.com> wrote:
>>
>> Hi,
>>
>> After running the cluster dumper on Kmeans output I am getting only
>> Key of Sequence File.
>>
>> Options provided for cluster dumper is:-
>>
>> -i <<cluster-*-final of Kmeans>> -o <<Output File>>  -p
>> <<clusteredPoint>> -of CSV
>>
>> Is it something that I am missing.
>>
>> PN: I am using sequential mode.
>>
>> --
>> Regards
>> Bikash Gupta
>
>
>
> --
> Thanks & Regards
> Bikash Kumar Gupta



-- 
Thanks & Regards
Bikash Kumar Gupta

Re: Cluster Dumper in 0.9

Posted by Bikash Gupta <bi...@gmail.com>.
I am running cluster dumper

After extracting output from Cluster dump I am transposing the row to
column, hence I have directly called this class from my java code.

Code:

ClusterDumper.main(new String[] {
                buildOption(DefaultOptionCreator.INPUT_OPTION),seqFileDir,
                buildOption(DefaultOptionCreator.OUTPUT_OPTION),outputFile,
                buildOption(ClusterDumper.OUTPUT_FORMAT_OPT),format,
                buildOption(ClusterDumper.POINTS_DIR_OPTION),pointsDir
                });

I have attached output too. Please note Key of Sequence File is
Text.class and its seperated using "`" character. I have also attached
Cluster Metadata




On Wed, Feb 19, 2014 at 9:21 PM, Suneel Marthi <su...@yahoo.com> wrote:
> R u running clusterdump or seqdumper?
>
> Could u paste the commands that u had run and their respective outputs?
>
>
>
>
>
>
>
> On Wednesday, February 19, 2014 6:16 AM, Bikash Gupta <bi...@gmail.com> wrote:
>
> Hi,
>
> After running the cluster dumper on Kmeans output I am getting only
> Key of Sequence File.
>
> Options provided for cluster dumper is:-
>
> -i <<cluster-*-final of Kmeans>> -o <<Output File>>  -p
> <<clusteredPoint>> -of CSV
>
> Is it something that I am missing.
>
> PN: I am using sequential mode.
>
> --
> Regards
> Bikash Gupta



-- 
Thanks & Regards
Bikash Kumar Gupta

Re: Cluster Dumper in 0.9

Posted by Suneel Marthi <su...@yahoo.com>.
R u running clusterdump or seqdumper?

Could u paste the commands that u had run and their respective outputs?







On Wednesday, February 19, 2014 6:16 AM, Bikash Gupta <bi...@gmail.com> wrote:
 
Hi,

After running the cluster dumper on Kmeans output I am getting only
Key of Sequence File.

Options provided for cluster dumper is:-

-i <<cluster-*-final of Kmeans>> -o <<Output File>>  -p
<<clusteredPoint>> -of CSV

Is it something that I am missing.

PN: I am using sequential mode.

-- 
Regards
Bikash Gupta