You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2012/09/12 18:24:55 UTC

Need dev advice: SSVD - Clustering Pipeline

There appears to be a gap in the pipeline SSVD-->Clustering. It can be patched in a couple ways so can the devs please advise before we make a patch:

The Issues:
  * There is currently no output from clustering that maps input vectors to clusters, unless you input NamedVectors to clustering.
  * SSVD does not output NamedVectors even if they are input.

Solutions:
  1. We could modify clustering to output in the file clusteredPoints/part-xxxx ID-Vector pairs, Where IDs are Keys of the original input vectors and the Vector would be the original input VectorWritable. This might be done by replacing the WeightedVectorWritable with a WeightedPropertyVectorWritable and putting the ID in properties. This would require a change in the clustering classifier but no change to SSVD or the rest of clustering. This would impact anyone using clusteredPoints since they would have to deal with a new output vector type (actually wasn't this file using WeightedPropertyVectorWritable before the mahout 0.7 refactoring?)
  2. We could alter SSVD to output NamedVectors and Clustering would simply pass them through without modification as it does today. This would require a change to SSVD but not to Clustering. Since NamedVectors seems to be the only way to perform this mapping now, there would be very little impact on current users.

Afaict one of these has to be done and they are not mutually exclusive. Any advice?

Re: Need dev advice: SSVD - Clustering Pipeline

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Oops, #1 should say:
1. We could modify clustering to output in the file clusteredPoints/part-xxxx ID-Vector pairs, Where IDs are Keys to clusters and the Vector would be the original input VectorWritable, but with an ID attached. This might be done by replacing the WeightedVectorWritable with a WeightedPropertyVectorWritable and putting the ID in properties. This would require a change in the clustering classifier but no change to SSVD or the rest of clustering. This would impact anyone using clusteredPoints since they would have to deal with a new output vector type (actually wasn't this file using WeightedPropertyVectorWritable before the mahout 0.7 refactoring?)

On Sep 12, 2012, at 9:24 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

There appears to be a gap in the pipeline SSVD-->Clustering. It can be patched in a couple ways so can the devs please advise before we make a patch:

The Issues:
* There is currently no output from clustering that maps input vectors to clusters, unless you input NamedVectors to clustering.
* SSVD does not output NamedVectors even if they are input.

Solutions:
1. We could modify clustering to output in the file clusteredPoints/part-xxxx ID-Vector pairs, Where IDs are Keys of the original input vectors and the Vector would be the original input VectorWritable. This might be done by replacing the WeightedVectorWritable with a WeightedPropertyVectorWritable and putting the ID in properties. This would require a change in the clustering classifier but no change to SSVD or the rest of clustering. This would impact anyone using clusteredPoints since they would have to deal with a new output vector type (actually wasn't this file using WeightedPropertyVectorWritable before the mahout 0.7 refactoring?)
2. We could alter SSVD to output NamedVectors and Clustering would simply pass them through without modification as it does today. This would require a change to SSVD but not to Clustering. Since NamedVectors seems to be the only way to perform this mapping now, there would be very little impact on current users.

Afaict one of these has to be done and they are not mutually exclusive. Any advice?

Re: Need dev advice: SSVD - Clustering Pipeline

Posted by Pat Ferrel <pa...@gmail.com>.

Ok, it all seems to work for U and USigma. NamedVectors go all the way through to the clusteredPoints. 

I'll do some work on the clusterdump kmeans + SSVD test and submit it back.

Thanks Dmitriy!

On Sep 12, 2012, at 11:26 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

Ok i committed first round at
https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1067

Could you perhaps test named vector propagation with it? I did not
write any unit tests for named vector propagation yet and i need to
run now.

Note that api has changed to accomodate for USigma so you need to set
it in the api and use getUSigmaPath() after completion.

This issue is now tracked thru MAHOUT-1067.

-d

On Wed, Sep 12, 2012 at 9:55 AM, Pat Ferrel <pa...@gmail.com> wrote:
> This is my personally favored solution. I wish NamedVectors were used in RowSimilarity too, and may submit a patch for it. If you output NamedVectors then they would enable the RowSimilarity patch too.
> 
> If you want someone to do some ad hoc testing with real world data, I'm in. I'll follow your github.
> 
> On Sep 12, 2012, at 9:42 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
> I will file and work on a patch for SSVD to propagate named vectors
> (if present). This is trivial. + USigma output. Will publish in a few
> in my github.
> 
> On Wed, Sep 12, 2012 at 9:24 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>> There appears to be a gap in the pipeline SSVD-->Clustering. It can be patched in a couple ways so can the devs please advise before we make a patch:
>> 
>> The Issues:
>> * There is currently no output from clustering that maps input vectors to clusters, unless you input NamedVectors to clustering.
>> * SSVD does not output NamedVectors even if they are input.
>> 
>> Solutions:
>> 1. We could modify clustering to output in the file clusteredPoints/part-xxxx ID-Vector pairs, Where IDs are Keys of the original input vectors and the Vector would be the original input VectorWritable. This might be done by replacing the WeightedVectorWritable with a WeightedPropertyVectorWritable and putting the ID in properties. This would require a change in the clustering classifier but no change to SSVD or the rest of clustering. This would impact anyone using clusteredPoints since they would have to deal with a new output vector type (actually wasn't this file using WeightedPropertyVectorWritable before the mahout 0.7 refactoring?)
>> 2. We could alter SSVD to output NamedVectors and Clustering would simply pass them through without modification as it does today. This would require a change to SSVD but not to Clustering. Since NamedVectors seems to be the only way to perform this mapping now, there would be very little impact on current users.
>> 
>> Afaict one of these has to be done and they are not mutually exclusive. Any advice?
>> 
>

Re: Need dev advice: SSVD - Clustering Pipeline

Posted by Pat Ferrel <pa...@gmail.com>.

ok, need to refresh the trunk too so it may take a few.

On Sep 12, 2012, at 11:26 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

Ok i committed first round at
https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1067

Could you perhaps test named vector propagation with it? I did not
write any unit tests for named vector propagation yet and i need to
run now.

Note that api has changed to accomodate for USigma so you need to set
it in the api and use getUSigmaPath() after completion.

This issue is now tracked thru MAHOUT-1067.

-d

On Wed, Sep 12, 2012 at 9:55 AM, Pat Ferrel <pa...@gmail.com> wrote:
> This is my personally favored solution. I wish NamedVectors were used in RowSimilarity too, and may submit a patch for it. If you output NamedVectors then they would enable the RowSimilarity patch too.
> 
> If you want someone to do some ad hoc testing with real world data, I'm in. I'll follow your github.
> 
> On Sep 12, 2012, at 9:42 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
> I will file and work on a patch for SSVD to propagate named vectors
> (if present). This is trivial. + USigma output. Will publish in a few
> in my github.
> 
> On Wed, Sep 12, 2012 at 9:24 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>> There appears to be a gap in the pipeline SSVD-->Clustering. It can be patched in a couple ways so can the devs please advise before we make a patch:
>> 
>> The Issues:
>> * There is currently no output from clustering that maps input vectors to clusters, unless you input NamedVectors to clustering.
>> * SSVD does not output NamedVectors even if they are input.
>> 
>> Solutions:
>> 1. We could modify clustering to output in the file clusteredPoints/part-xxxx ID-Vector pairs, Where IDs are Keys of the original input vectors and the Vector would be the original input VectorWritable. This might be done by replacing the WeightedVectorWritable with a WeightedPropertyVectorWritable and putting the ID in properties. This would require a change in the clustering classifier but no change to SSVD or the rest of clustering. This would impact anyone using clusteredPoints since they would have to deal with a new output vector type (actually wasn't this file using WeightedPropertyVectorWritable before the mahout 0.7 refactoring?)
>> 2. We could alter SSVD to output NamedVectors and Clustering would simply pass them through without modification as it does today. This would require a change to SSVD but not to Clustering. Since NamedVectors seems to be the only way to perform this mapping now, there would be very little impact on current users.
>> 
>> Afaict one of these has to be done and they are not mutually exclusive. Any advice?
>> 
>

Re: Need dev advice: SSVD - Clustering Pipeline

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Ok i committed first round at
https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1067

Could you perhaps test named vector propagation with it? I did not
write any unit tests for named vector propagation yet and i need to
run now.

Note that api has changed to accomodate for USigma so you need to set
it in the api and use getUSigmaPath() after completion.

This issue is now tracked thru MAHOUT-1067.

-d

On Wed, Sep 12, 2012 at 9:55 AM, Pat Ferrel <pa...@gmail.com> wrote:
> This is my personally favored solution. I wish NamedVectors were used in RowSimilarity too, and may submit a patch for it. If you output NamedVectors then they would enable the RowSimilarity patch too.
>
> If you want someone to do some ad hoc testing with real world data, I'm in. I'll follow your github.
>
> On Sep 12, 2012, at 9:42 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> I will file and work on a patch for SSVD to propagate named vectors
> (if present). This is trivial. + USigma output. Will publish in a few
> in my github.
>
> On Wed, Sep 12, 2012 at 9:24 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>> There appears to be a gap in the pipeline SSVD-->Clustering. It can be patched in a couple ways so can the devs please advise before we make a patch:
>>
>> The Issues:
>>  * There is currently no output from clustering that maps input vectors to clusters, unless you input NamedVectors to clustering.
>>  * SSVD does not output NamedVectors even if they are input.
>>
>> Solutions:
>>  1. We could modify clustering to output in the file clusteredPoints/part-xxxx ID-Vector pairs, Where IDs are Keys of the original input vectors and the Vector would be the original input VectorWritable. This might be done by replacing the WeightedVectorWritable with a WeightedPropertyVectorWritable and putting the ID in properties. This would require a change in the clustering classifier but no change to SSVD or the rest of clustering. This would impact anyone using clusteredPoints since they would have to deal with a new output vector type (actually wasn't this file using WeightedPropertyVectorWritable before the mahout 0.7 refactoring?)
>>  2. We could alter SSVD to output NamedVectors and Clustering would simply pass them through without modification as it does today. This would require a change to SSVD but not to Clustering. Since NamedVectors seems to be the only way to perform this mapping now, there would be very little impact on current users.
>>
>> Afaict one of these has to be done and they are not mutually exclusive. Any advice?
>>
>

Re: Need dev advice: SSVD - Clustering Pipeline

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

+1 my distinct favorite too

On 9/12/12 12:55 PM, Pat Ferrel wrote:
> This is my personally favored solution. I wish NamedVectors were used in RowSimilarity too, and may submit a patch for it. If you output NamedVectors then they would enable the RowSimilarity patch too.
>
> If you want someone to do some ad hoc testing with real world data, I'm in. I'll follow your github.
>
> On Sep 12, 2012, at 9:42 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> I will file and work on a patch for SSVD to propagate named vectors
> (if present). This is trivial. + USigma output. Will publish in a few
> in my github.
>
> On Wed, Sep 12, 2012 at 9:24 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>> There appears to be a gap in the pipeline SSVD-->Clustering. It can be patched in a couple ways so can the devs please advise before we make a patch:
>>
>> The Issues:
>>   * There is currently no output from clustering that maps input vectors to clusters, unless you input NamedVectors to clustering.
>>   * SSVD does not output NamedVectors even if they are input.
>>
>> Solutions:
>>   1. We could modify clustering to output in the file clusteredPoints/part-xxxx ID-Vector pairs, Where IDs are Keys of the original input vectors and the Vector would be the original input VectorWritable. This might be done by replacing the WeightedVectorWritable with a WeightedPropertyVectorWritable and putting the ID in properties. This would require a change in the clustering classifier but no change to SSVD or the rest of clustering. This would impact anyone using clusteredPoints since they would have to deal with a new output vector type (actually wasn't this file using WeightedPropertyVectorWritable before the mahout 0.7 refactoring?)
>>   2. We could alter SSVD to output NamedVectors and Clustering would simply pass them through without modification as it does today. This would require a change to SSVD but not to Clustering. Since NamedVectors seems to be the only way to perform this mapping now, there would be very little impact on current users.
>>
>> Afaict one of these has to be done and they are not mutually exclusive. Any advice?
>>
>
>

Re: Need dev advice: SSVD - Clustering Pipeline

Posted by Pat Ferrel <pa...@gmail.com>.

This is my personally favored solution. I wish NamedVectors were used in RowSimilarity too, and may submit a patch for it. If you output NamedVectors then they would enable the RowSimilarity patch too. 

If you want someone to do some ad hoc testing with real world data, I'm in. I'll follow your github.

On Sep 12, 2012, at 9:42 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

I will file and work on a patch for SSVD to propagate named vectors
(if present). This is trivial. + USigma output. Will publish in a few
in my github.

On Wed, Sep 12, 2012 at 9:24 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> There appears to be a gap in the pipeline SSVD-->Clustering. It can be patched in a couple ways so can the devs please advise before we make a patch:
> 
> The Issues:
>  * There is currently no output from clustering that maps input vectors to clusters, unless you input NamedVectors to clustering.
>  * SSVD does not output NamedVectors even if they are input.
> 
> Solutions:
>  1. We could modify clustering to output in the file clusteredPoints/part-xxxx ID-Vector pairs, Where IDs are Keys of the original input vectors and the Vector would be the original input VectorWritable. This might be done by replacing the WeightedVectorWritable with a WeightedPropertyVectorWritable and putting the ID in properties. This would require a change in the clustering classifier but no change to SSVD or the rest of clustering. This would impact anyone using clusteredPoints since they would have to deal with a new output vector type (actually wasn't this file using WeightedPropertyVectorWritable before the mahout 0.7 refactoring?)
>  2. We could alter SSVD to output NamedVectors and Clustering would simply pass them through without modification as it does today. This would require a change to SSVD but not to Clustering. Since NamedVectors seems to be the only way to perform this mapping now, there would be very little impact on current users.
> 
> Afaict one of these has to be done and they are not mutually exclusive. Any advice?
>

Re: Need dev advice: SSVD - Clustering Pipeline

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

I will file and work on a patch for SSVD to propagate named vectors
(if present). This is trivial. + USigma output. Will publish in a few
in my github.

On Wed, Sep 12, 2012 at 9:24 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> There appears to be a gap in the pipeline SSVD-->Clustering. It can be patched in a couple ways so can the devs please advise before we make a patch:
>
> The Issues:
>   * There is currently no output from clustering that maps input vectors to clusters, unless you input NamedVectors to clustering.
>   * SSVD does not output NamedVectors even if they are input.
>
> Solutions:
>   1. We could modify clustering to output in the file clusteredPoints/part-xxxx ID-Vector pairs, Where IDs are Keys of the original input vectors and the Vector would be the original input VectorWritable. This might be done by replacing the WeightedVectorWritable with a WeightedPropertyVectorWritable and putting the ID in properties. This would require a change in the clustering classifier but no change to SSVD or the rest of clustering. This would impact anyone using clusteredPoints since they would have to deal with a new output vector type (actually wasn't this file using WeightedPropertyVectorWritable before the mahout 0.7 refactoring?)
>   2. We could alter SSVD to output NamedVectors and Clustering would simply pass them through without modification as it does today. This would require a change to SSVD but not to Clustering. Since NamedVectors seems to be the only way to perform this mapping now, there would be very little impact on current users.
>
> Afaict one of these has to be done and they are not mutually exclusive. Any advice?
>

Re: Need dev advice: SSVD - Clustering Pipeline

Posted by Pat Ferrel <pa...@gmail.com>.

There ARE integer keys in the DRM, which is why I proposed #2 and asked for advice. Anyone using clustering for classification today has to use NamedVectors or write their own classifier. I think Paritosh verified this and it is rather obvious to clustering users. I'd think of the Names as additional data that has many benefits, not a replacement for the DRM contract about keys. I'd guess the Names are pretty much ignored by DRM calculations, though I'd hope they are preserved where present.

As to NamedVectors I would like to see them used more widely when they are present. RowSimilarity preserves the Keys in it's output file and throws away the Names even when NamedVectors are present (clustering throws away the IDs and keeps Names). I will probably get around to submitting a patch for RowSimilarity so people can use NamedVectors in a larger part of their analysis. It does away with the need for a huge index of ID/Key to Name created by RowID, which is used to produce the DRM inout to RowSimilarity. So RowSimilarity consumes a DRM with NamedVectors but does not use the names in the non-DRM output.

Also if you have a scheme where Names are globally unique, like I do when using URLs for Names, the unique Names are quite useful for adding new data or comparing data between different runs of analysis. There are always other ways to do these things but I find the Names useful in several ways outside of clustering.

A slight digression here… In general I try to denormalize my data (here talking of database denormalization) which loosely means keeping associated data together even if it causes duplication. With huge databases you want to minimize the complexity and number of queries and having nice foreign keys (DB index Keys here) the resulting queries are about as fast as they can be. These can be computed when data is inserted into the db but having the URL associated with vectors and as a db key makes things much simpler to deal with and faster. I mention this because the same principal could be applied to hadoop/mahout in some places. Try to do away with lookup tables where convenient. At present the only one I think could be removed is the RowSimilarity one, the rest I have dealt with seem necessary. 

So these are some of the reasons I like NamedVectors. Still I would be OK with #2 solution, which would have to be a fix to the Clustering classifier.

On Sep 12, 2012, at 8:59 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

Yeah. I see. I was under impression that general distributed matrix
contract was leaning towards identifying rows by sequence file keys.

The reasoning is that if they were identified by named vector, it would
have required to have named vector. But it doesnt.

On the other hand, drm always requires sequence file keys.

So relying on named vector contract  is not fool proof as we have
discovered.
On Sep 12, 2012 8:55 PM, "Pat Ferrel" <pa...@gmail.com> wrote:

> To be clear this change only affects classification of the input vectors.
> Everything else in clustering works fine without it. I need to know which
> vectors are in which clusters, it is why I run clustering, for its
> classification function. There will be many who don't care about
> classification.
> 
> On Sep 12, 2012, at 8:27 PM, Pat Ferrel <pa...@gmail.com> wrote:
> 
> Yes, you have output but it is only partly useful.
> 
> There are two things created during clustering:
> Clusters, which are basically centroids and their vectors
> If you ask the driver to classify your input into clusters, you get
> clusteredPoints
> Both of these are created, even without NamedVectors. The clusters
> centroids are quite alright with non-NamedVectors as input. However though
> clusteredPoints is created there is no way to tell which vectors are
> classified by cluster since all you get is anonymous weights in the
> vectors. How can you tell which doc was in which cluster?
> 
> Creating a new classifier that would attach vector IDs when there is no
> NamedVector is my #2 solution below.
> 
> So yes, it still runs and produces clusters but in my application and I
> suspect quite a few others, the cluster is only of interest if the input is
> classified into the clusters.
> 
> On Sep 12, 2012, at 7:07 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
> I am curious though.
> 
> do you really have no cluster output unless Named vectors are used?
> 
> It is strange because even if I did not use Named vectors, i would
> still expect for for clusters to form correctly, with the cluster ids
> and points and top terms. So cluster dumper should still produce
> document vectors (even if without original name) and top terms, i.e.
> clustered points should not be empty. After all, I am not obliged to
> follow text analysis pipeline as in the MIA, i might as well come up
> with my own DRM i would like to find clusters for; and i might not
> have used text labels in that matrix..
> 
> On Wed, Sep 12, 2012 at 9:24 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>> There appears to be a gap in the pipeline SSVD-->Clustering. It can be
> patched in a couple ways so can the devs please advise before we make a
> patch:
>> 
>> The Issues:
>> * There is currently no output from clustering that maps input vectors
> to clusters, unless you input NamedVectors to clustering.
>> * SSVD does not output NamedVectors even if they are input.
>> 
>> Solutions:
>> 1. We could modify clustering to output in the file
> clusteredPoints/part-xxxx ID-Vector pairs, Where IDs are Keys of the
> original input vectors and the Vector would be the original input
> VectorWritable. This might be done by replacing the WeightedVectorWritable
> with a WeightedPropertyVectorWritable and putting the ID in properties.
> This would require a change in the clustering classifier but no change to
> SSVD or the rest of clustering. This would impact anyone using
> clusteredPoints since they would have to deal with a new output vector type
> (actually wasn't this file using WeightedPropertyVectorWritable before the
> mahout 0.7 refactoring?)
>> 2. We could alter SSVD to output NamedVectors and Clustering would
> simply pass them through without modification as it does today. This would
> require a change to SSVD but not to Clustering. Since NamedVectors seems to
> be the only way to perform this mapping now, there would be very little
> impact on current users.
>> 
>> Afaict one of these has to be done and they are not mutually exclusive.
> Any advice?
>> 
> 
> 
>

Re: Need dev advice: SSVD - Clustering Pipeline

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Yeah. I see. I was under impression that general distributed matrix
contract was leaning towards identifying rows by sequence file keys.

The reasoning is that if they were identified by named vector, it would
have required to have named vector. But it doesnt.

On the other hand, drm always requires sequence file keys.

So relying on named vector contract  is not fool proof as we have
discovered.
On Sep 12, 2012 8:55 PM, "Pat Ferrel" <pa...@gmail.com> wrote:

> To be clear this change only affects classification of the input vectors.
> Everything else in clustering works fine without it. I need to know which
> vectors are in which clusters, it is why I run clustering, for its
> classification function. There will be many who don't care about
> classification.
>
> On Sep 12, 2012, at 8:27 PM, Pat Ferrel <pa...@gmail.com> wrote:
>
> Yes, you have output but it is only partly useful.
>
> There are two things created during clustering:
> Clusters, which are basically centroids and their vectors
> If you ask the driver to classify your input into clusters, you get
> clusteredPoints
> Both of these are created, even without NamedVectors. The clusters
> centroids are quite alright with non-NamedVectors as input. However though
> clusteredPoints is created there is no way to tell which vectors are
> classified by cluster since all you get is anonymous weights in the
> vectors. How can you tell which doc was in which cluster?
>
> Creating a new classifier that would attach vector IDs when there is no
> NamedVector is my #2 solution below.
>
> So yes, it still runs and produces clusters but in my application and I
> suspect quite a few others, the cluster is only of interest if the input is
> classified into the clusters.
>
> On Sep 12, 2012, at 7:07 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> I am curious though.
>
> do you really have no cluster output unless Named vectors are used?
>
> It is strange because even if I did not use Named vectors, i would
> still expect for for clusters to form correctly, with the cluster ids
> and points and top terms. So cluster dumper should still produce
> document vectors (even if without original name) and top terms, i.e.
> clustered points should not be empty. After all, I am not obliged to
> follow text analysis pipeline as in the MIA, i might as well come up
> with my own DRM i would like to find clusters for; and i might not
> have used text labels in that matrix..
>
> On Wed, Sep 12, 2012 at 9:24 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> > There appears to be a gap in the pipeline SSVD-->Clustering. It can be
> patched in a couple ways so can the devs please advise before we make a
> patch:
> >
> > The Issues:
> >  * There is currently no output from clustering that maps input vectors
> to clusters, unless you input NamedVectors to clustering.
> >  * SSVD does not output NamedVectors even if they are input.
> >
> > Solutions:
> >  1. We could modify clustering to output in the file
> clusteredPoints/part-xxxx ID-Vector pairs, Where IDs are Keys of the
> original input vectors and the Vector would be the original input
> VectorWritable. This might be done by replacing the WeightedVectorWritable
> with a WeightedPropertyVectorWritable and putting the ID in properties.
> This would require a change in the clustering classifier but no change to
> SSVD or the rest of clustering. This would impact anyone using
> clusteredPoints since they would have to deal with a new output vector type
> (actually wasn't this file using WeightedPropertyVectorWritable before the
> mahout 0.7 refactoring?)
> >  2. We could alter SSVD to output NamedVectors and Clustering would
> simply pass them through without modification as it does today. This would
> require a change to SSVD but not to Clustering. Since NamedVectors seems to
> be the only way to perform this mapping now, there would be very little
> impact on current users.
> >
> > Afaict one of these has to be done and they are not mutually exclusive.
> Any advice?
> >
>
>
>

Re: Need dev advice: SSVD - Clustering Pipeline

Posted by Pat Ferrel <pa...@gmail.com>.

To be clear this change only affects classification of the input vectors. Everything else in clustering works fine without it. I need to know which vectors are in which clusters, it is why I run clustering, for its classification function. There will be many who don't care about classification.

On Sep 12, 2012, at 8:27 PM, Pat Ferrel <pa...@gmail.com> wrote:

Yes, you have output but it is only partly useful.

There are two things created during clustering:
Clusters, which are basically centroids and their vectors
If you ask the driver to classify your input into clusters, you get clusteredPoints
Both of these are created, even without NamedVectors. The clusters centroids are quite alright with non-NamedVectors as input. However though clusteredPoints is created there is no way to tell which vectors are classified by cluster since all you get is anonymous weights in the vectors. How can you tell which doc was in which cluster?

Creating a new classifier that would attach vector IDs when there is no NamedVector is my #2 solution below.

So yes, it still runs and produces clusters but in my application and I suspect quite a few others, the cluster is only of interest if the input is classified into the clusters.

On Sep 12, 2012, at 7:07 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

I am curious though.

do you really have no cluster output unless Named vectors are used?

It is strange because even if I did not use Named vectors, i would
still expect for for clusters to form correctly, with the cluster ids
and points and top terms. So cluster dumper should still produce
document vectors (even if without original name) and top terms, i.e.
clustered points should not be empty. After all, I am not obliged to
follow text analysis pipeline as in the MIA, i might as well come up
with my own DRM i would like to find clusters for; and i might not
have used text labels in that matrix..

On Wed, Sep 12, 2012 at 9:24 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> There appears to be a gap in the pipeline SSVD-->Clustering. It can be patched in a couple ways so can the devs please advise before we make a patch:
> 
> The Issues:
>  * There is currently no output from clustering that maps input vectors to clusters, unless you input NamedVectors to clustering.
>  * SSVD does not output NamedVectors even if they are input.
> 
> Solutions:
>  1. We could modify clustering to output in the file clusteredPoints/part-xxxx ID-Vector pairs, Where IDs are Keys of the original input vectors and the Vector would be the original input VectorWritable. This might be done by replacing the WeightedVectorWritable with a WeightedPropertyVectorWritable and putting the ID in properties. This would require a change in the clustering classifier but no change to SSVD or the rest of clustering. This would impact anyone using clusteredPoints since they would have to deal with a new output vector type (actually wasn't this file using WeightedPropertyVectorWritable before the mahout 0.7 refactoring?)
>  2. We could alter SSVD to output NamedVectors and Clustering would simply pass them through without modification as it does today. This would require a change to SSVD but not to Clustering. Since NamedVectors seems to be the only way to perform this mapping now, there would be very little impact on current users.
> 
> Afaict one of these has to be done and they are not mutually exclusive. Any advice?
>

Re: Need dev advice: SSVD - Clustering Pipeline

Posted by Pat Ferrel <pa...@gmail.com>.

Yes, you have output but it is only partly useful.

There are two things created during clustering:
Clusters, which are basically centroids and their vectors
If you ask the driver to classify your input into clusters, you get clusteredPoints
Both of these are created, even without NamedVectors. The clusters centroids are quite alright with non-NamedVectors as input. However though clusteredPoints is created there is no way to tell which vectors are classified by cluster since all you get is anonymous weights in the vectors. How can you tell which doc was in which cluster?

Creating a new classifier that would attach vector IDs when there is no NamedVector is my #2 solution below.

So yes, it still runs and produces clusters but in my application and I suspect quite a few others, the cluster is only of interest if the input is classified into the clusters.

On Sep 12, 2012, at 7:07 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

I am curious though.

do you really have no cluster output unless Named vectors are used?

It is strange because even if I did not use Named vectors, i would
still expect for for clusters to form correctly, with the cluster ids
and points and top terms. So cluster dumper should still produce
document vectors (even if without original name) and top terms, i.e.
clustered points should not be empty. After all, I am not obliged to
follow text analysis pipeline as in the MIA, i might as well come up
with my own DRM i would like to find clusters for; and i might not
have used text labels in that matrix..

On Wed, Sep 12, 2012 at 9:24 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> There appears to be a gap in the pipeline SSVD-->Clustering. It can be patched in a couple ways so can the devs please advise before we make a patch:
> 
> The Issues:
>  * There is currently no output from clustering that maps input vectors to clusters, unless you input NamedVectors to clustering.
>  * SSVD does not output NamedVectors even if they are input.
> 
> Solutions:
>  1. We could modify clustering to output in the file clusteredPoints/part-xxxx ID-Vector pairs, Where IDs are Keys of the original input vectors and the Vector would be the original input VectorWritable. This might be done by replacing the WeightedVectorWritable with a WeightedPropertyVectorWritable and putting the ID in properties. This would require a change in the clustering classifier but no change to SSVD or the rest of clustering. This would impact anyone using clusteredPoints since they would have to deal with a new output vector type (actually wasn't this file using WeightedPropertyVectorWritable before the mahout 0.7 refactoring?)
>  2. We could alter SSVD to output NamedVectors and Clustering would simply pass them through without modification as it does today. This would require a change to SSVD but not to Clustering. Since NamedVectors seems to be the only way to perform this mapping now, there would be very little impact on current users.
> 
> Afaict one of these has to be done and they are not mutually exclusive. Any advice?
>

Re: Need dev advice: SSVD - Clustering Pipeline

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

I am curious though.

do you really have no cluster output unless Named vectors are used?

It is strange because even if I did not use Named vectors, i would
still expect for for clusters to form correctly, with the cluster ids
and points and top terms. So cluster dumper should still produce
document vectors (even if without original name) and top terms, i.e.
clustered points should not be empty. After all, I am not obliged to
follow text analysis pipeline as in the MIA, i might as well come up
with my own DRM i would like to find clusters for; and i might not
have used text labels in that matrix..

On Wed, Sep 12, 2012 at 9:24 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> There appears to be a gap in the pipeline SSVD-->Clustering. It can be patched in a couple ways so can the devs please advise before we make a patch:
>
> The Issues:
>   * There is currently no output from clustering that maps input vectors to clusters, unless you input NamedVectors to clustering.
>   * SSVD does not output NamedVectors even if they are input.
>
> Solutions:
>   1. We could modify clustering to output in the file clusteredPoints/part-xxxx ID-Vector pairs, Where IDs are Keys of the original input vectors and the Vector would be the original input VectorWritable. This might be done by replacing the WeightedVectorWritable with a WeightedPropertyVectorWritable and putting the ID in properties. This would require a change in the clustering classifier but no change to SSVD or the rest of clustering. This would impact anyone using clusteredPoints since they would have to deal with a new output vector type (actually wasn't this file using WeightedPropertyVectorWritable before the mahout 0.7 refactoring?)
>   2. We could alter SSVD to output NamedVectors and Clustering would simply pass them through without modification as it does today. This would require a change to SSVD but not to Clustering. Since NamedVectors seems to be the only way to perform this mapping now, there would be very little impact on current users.
>
> Afaict one of these has to be done and they are not mutually exclusive. Any advice?
>