You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Colum Foley <co...@gmail.com> on 2013/03/06 10:58:58 UTC

Elephant-Bird NamedVector Format

Hi,

I am using the Elephant-Bird package to write feature vectors to
sequence files for use in clustering with Mahout. The problem I have
is that when I run K-Means clustering, I cannot decipher which item
has been assigned to which cluster. I asked a question on this forum
yesterday and the suggestion was to use NamedVectors to store the name
of the vector so that I can read this back in when reviewing the
results.

When I print out the contents of the clusteredPoints/part-m-00000
file, I see the following:

1.0: [9684:1.000] is in 1205002
1.0: [713:1.000, 1022:1.000, 5514:1.000, 9098:1.000] is in 1205002
1.0: [5414:1.000] is in 1205002
1.0: [5158:2.000] is in 1205002
1.0: [424:1.000] is in 1205002
1.0: [7460:3.000] is in 1205002

But I need a way to resolve the vector names.

My question is: can Elephant-Bird store vectors in NamedVector format,
where the key is used as the name?

If not can anyone suggest another way that I can resolve the names of
items that have been assigned to clusters in the absence of
NamedVectors?


Many thanks in advance,
Colum

Re: Elephant-Bird NamedVector Format

Posted by Andy Schlaikjer <an...@gmail.com>.
Could you perhaps load the results back into Pig and join the unnamed data
with a relation which contains the names?


On Wed, Mar 6, 2013 at 1:58 AM, Colum Foley <co...@gmail.com> wrote:

> Hi,
>
> I am using the Elephant-Bird package to write feature vectors to
> sequence files for use in clustering with Mahout. The problem I have
> is that when I run K-Means clustering, I cannot decipher which item
> has been assigned to which cluster. I asked a question on this forum
> yesterday and the suggestion was to use NamedVectors to store the name
> of the vector so that I can read this back in when reviewing the
> results.
>
> When I print out the contents of the clusteredPoints/part-m-00000
> file, I see the following:
>
> 1.0: [9684:1.000] is in 1205002
> 1.0: [713:1.000, 1022:1.000, 5514:1.000, 9098:1.000] is in 1205002
> 1.0: [5414:1.000] is in 1205002
> 1.0: [5158:2.000] is in 1205002
> 1.0: [424:1.000] is in 1205002
> 1.0: [7460:3.000] is in 1205002
>
> But I need a way to resolve the vector names.
>
> My question is: can Elephant-Bird store vectors in NamedVector format,
> where the key is used as the name?
>
> If not can anyone suggest another way that I can resolve the names of
> items that have been assigned to clusters in the absence of
> NamedVectors?
>
>
> Many thanks in advance,
> Colum
>