You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Sampath Jayarathna <uk...@gmail.com> on 2012/08/15 18:50:42 UTC
Creating Mahout vectors from existing vectors
I already have a (term,weight) data using which I wanted to do an LDA
analysis to find the topics distribution.
How should I create the Mahout vectors from this?
Documentation says, I can use VectorWriter, but I'm not sure how to go with
this.
Converting existing vectors to Mahout's format
>
> If you are in the happy position to already own a document (as in: texts,
> images or whatever item you wish to treat) processing pipeline, the
> question arises of how to convert the vectors into the Mahout vector
> format. Probably the easiest way to go would be to implement your own
> Iterable<Vector> (called VectorIterable in the example below) and then
> reuse the existing VectorWriter classes:
>
> VectorWriter vectorWriter = SequenceFile.createWriter(filesystem,
> configuration, outfile, LongWritable.class, SparseVector.class);
>
long numDocs = vectorWriter.write(new VectorIterable(), Long.MAX_VALUE);
>
Thanks
-Sam
Re: Creating Mahout vectors from existing vectors
Posted by Sampath Jayarathna <uk...@gmail.com>.
Hi,
I'm trying to create Mahout vector representation from my own
term-frequency values so I can use LDA. I have the data in the following
format, (term,frequency). I understand I should read my term, frequency
pairs and then create mahout vectros using the exisiting VectorWriter class
and I found following code segment suggesting to do so,
VectorWriter vectorWriter = SequenceFile.createWriter(filesystem,
configuration, outfile, LongWritable.class, SparseVector.class);long
numDocs = vectorWriter.write(new VectorIterable(), Long.MAX_VALUE);
but, I cannot find the VectorWriter in org.apache.mahout.utils.vectors.io
I used SequenceFile.Writer but seems to me I"m just creating a sequence
file format, not the Mahout Vector format.
My code segment is blow, it will be great if you can point me how to do
this?
Path path = new Path("/home/hadoop/LDA/LDAHome/LDAData/output");
Configuration conf = new Configuration();
FileSystem fs;
SequenceFile.Writer writer = null;
BufferedReader buffer;
try {
buffer = new BufferedReader(new
FileReader("/home/hadoop/LDA/LDAHome/LDAData/test"));
String line = null;
org.apache.hadoop.io.Text key = new org.apache.hadoop.io.Text();
org.apache.hadoop.io.Text value = new
org.apache.hadoop.io.Text();
try
{
try {
fs = FileSystem.get(conf);
writer = SequenceFile.createWriter(fs, conf, path,
key.getClass(), value.getClass());
String[] temp;
while((line = buffer.readLine()) != null)
{ temp = line.split(" ");
key.set(temp[0]);
value.set(temp[1]);
writer.append(key, value);
}
............................
I really appreciate your help.
Thanks
Sam
On Wed, Aug 15, 2012 at 11:50 AM, Sampath Jayarathna <
uksjayarathna@gmail.com> wrote:
> I already have a (term,weight) data using which I wanted to do an LDA
> analysis to find the topics distribution.
>
> How should I create the Mahout vectors from this?
> Documentation says, I can use VectorWriter, but I'm not sure how to go
> with this.
>
>
> Converting existing vectors to Mahout's format
>>
>
>> If you are in the happy position to already own a document (as in: texts,
>> images or whatever item you wish to treat) processing pipeline, the
>> question arises of how to convert the vectors into the Mahout vector
>> format. Probably the easiest way to go would be to implement your own
>> Iterable<Vector> (called VectorIterable in the example below) and then
>> reuse the existing VectorWriter classes:
>>
>
>> VectorWriter vectorWriter = SequenceFile.createWriter(filesystem,
>> configuration, outfile, LongWritable.class, SparseVector.class);
>>
> long numDocs = vectorWriter.write(new VectorIterable(), Long.MAX_VALUE);
>>
>
>
>
> Thanks
>
> -Sam
>