You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Sampath Jayarathna <uk...@gmail.com> on 2012/08/15 18:50:42 UTC

Creating Mahout vectors from existing vectors

I already have a (term,weight) data using which I wanted to do an LDA
analysis to find the topics distribution.

How should I create the Mahout vectors from this?
Documentation says, I can use VectorWriter, but I'm not sure how to go with
this.


Converting existing vectors to Mahout's format
>

> If you are in the happy position to already own a document (as in: texts,
> images or whatever item you wish to treat) processing pipeline, the
> question arises of how to convert the vectors into the Mahout vector
> format. Probably the easiest way to go would be to implement your own
> Iterable<Vector> (called VectorIterable in the example below) and then
> reuse the existing VectorWriter classes:
>

> VectorWriter vectorWriter = SequenceFile.createWriter(filesystem,
> configuration, outfile, LongWritable.class, SparseVector.class);
>
long numDocs = vectorWriter.write(new VectorIterable(), Long.MAX_VALUE);
>



Thanks

-Sam

Re: Creating Mahout vectors from existing vectors

Posted by Sampath Jayarathna <uk...@gmail.com>.
Hi,
       I'm trying to create Mahout vector representation from my own
term-frequency values so I can use LDA. I have the data in the following
format, (term,frequency). I understand I should read my term, frequency
pairs and then create mahout vectros using the exisiting VectorWriter class
and I found following code segment suggesting to do so,

VectorWriter vectorWriter = SequenceFile.createWriter(filesystem,
configuration, outfile, LongWritable.class, SparseVector.class);long
numDocs = vectorWriter.write(new VectorIterable(), Long.MAX_VALUE);

but, I cannot find the VectorWriter in org.apache.mahout.utils.vectors.io
I used SequenceFile.Writer but seems to me I"m just creating a sequence
file format, not the Mahout Vector format.
My code segment is blow, it will be great if you can point me how to do
this?

        Path path = new Path("/home/hadoop/LDA/LDAHome/LDAData/output");
        Configuration conf = new Configuration();
        FileSystem fs;
        SequenceFile.Writer writer = null;
        BufferedReader buffer;
        try {
            buffer = new BufferedReader(new
FileReader("/home/hadoop/LDA/LDAHome/LDAData/test"));
            String line = null;
            org.apache.hadoop.io.Text key = new org.apache.hadoop.io.Text();
            org.apache.hadoop.io.Text value = new
org.apache.hadoop.io.Text();
            try
            {

                try {
                    fs = FileSystem.get(conf);
                    writer = SequenceFile.createWriter(fs, conf, path,
key.getClass(), value.getClass());
                    String[] temp;
                    while((line = buffer.readLine()) != null)
                    {    temp = line.split(" ");
                        key.set(temp[0]);
                        value.set(temp[1]);
                        writer.append(key, value);
                    }
               ............................


I really appreciate your help.
Thanks

Sam



On Wed, Aug 15, 2012 at 11:50 AM, Sampath Jayarathna <
uksjayarathna@gmail.com> wrote:

> I already have a (term,weight) data using which I wanted to do an LDA
> analysis to find the topics distribution.
>
> How should I create the Mahout vectors from this?
> Documentation says, I can use VectorWriter, but I'm not sure how to go
> with this.
>
>
> Converting existing vectors to Mahout's format
>>
>
>> If you are in the happy position to already own a document (as in: texts,
>> images or whatever item you wish to treat) processing pipeline, the
>> question arises of how to convert the vectors into the Mahout vector
>> format. Probably the easiest way to go would be to implement your own
>> Iterable<Vector> (called VectorIterable in the example below) and then
>> reuse the existing VectorWriter classes:
>>
>
>> VectorWriter vectorWriter = SequenceFile.createWriter(filesystem,
>> configuration, outfile, LongWritable.class, SparseVector.class);
>>
> long numDocs = vectorWriter.write(new VectorIterable(), Long.MAX_VALUE);
>>
>
>
>
> Thanks
>
> -Sam
>