You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Mike Perry <mi...@gmail.com> on 2010/11/20 17:47:22 UTC

Sparse Vectors

Hello all,

Does the script to convert a Lucene index to Mahout vectors write sequence
files in sparse vector representation? my impression is that it doesn't but
I want to verify that.
Also, SparseVectorsFromSequenceFiles is used to convert the vectors to
sparse format (I know about the seq2sparse option). Could someone point out
where in the code it actually constructs the sparse vectors?  it seems to me
that one of the methods in DictionaryVectorizer generates the vectors but I
couldn't
find where exactly.

Many thanks guys!
MIke

Re: Sparse Vectors

Posted by Mike Perry <mi...@gmail.com>.
Excellent. Thanks!

On Sun, Nov 21, 2010 at 2:22 PM, Drew Farris <dr...@apache.org> wrote:

> Per o.a.m.utils.vectors.lucene.TFDFMapper, which is called from
> o.a.m.utils.vectors.lucene.Driver, the vectors created are instances
> of RandomAccessSparseVector
>
> On Sun, Nov 21, 2010 at 9:28 AM, Mike Perry <mi...@gmail.com>
> wrote:
> > Thanks Ted for the answer.
> >
> > "Should be sparse, but I can't say for sure."
> >
> > Could anybody confirm? in the quickstart-kmeans.sh script there's a step
> to
> > convert the data to SequenceFile format (seqdirectory) and then
> > a second step to convert the SequenceFiles to sparse vector format (
> > seq2sparse). That's why I'm asking.
> >
> >
> > On Sat, Nov 20, 2010 at 3:45 PM, Ted Dunning <te...@gmail.com>
> wrote:
> >
> >> On Sat, Nov 20, 2010 at 8:47 AM, Mike Perry <mikeperrycanada@gmail.com
> >> >wrote:
> >>
> >> > Hello all,
> >> >
> >> > Does the script to convert a Lucene index to Mahout vectors write
> >> sequence
> >> > files in sparse vector representation? my impression is that it
> doesn't
> >> but
> >> > I want to verify that.
> >> >
> >>
> >> Should be sparse, but I can't say for sure.
> >>
> >>
> >> > Also, SparseVectorsFromSequenceFiles is used to convert the vectors to
> >> > sparse format (I know about the seq2sparse option). Could someone
> point
> >> out
> >> > where in the code it actually constructs the sparse vectors?  it seems
> to
> >> > me
> >> > that one of the methods in DictionaryVectorizer generates the vectors
> but
> >> I
> >> > couldn't
> >> > find where exactly.
> >> >
> >>
> >> Look for VectorWritable.
> >>
> >
>

Re: Sparse Vectors

Posted by Drew Farris <dr...@apache.org>.
Per o.a.m.utils.vectors.lucene.TFDFMapper, which is called from
o.a.m.utils.vectors.lucene.Driver, the vectors created are instances
of RandomAccessSparseVector

On Sun, Nov 21, 2010 at 9:28 AM, Mike Perry <mi...@gmail.com> wrote:
> Thanks Ted for the answer.
>
> "Should be sparse, but I can't say for sure."
>
> Could anybody confirm? in the quickstart-kmeans.sh script there's a step to
> convert the data to SequenceFile format (seqdirectory) and then
> a second step to convert the SequenceFiles to sparse vector format (
> seq2sparse). That's why I'm asking.
>
>
> On Sat, Nov 20, 2010 at 3:45 PM, Ted Dunning <te...@gmail.com> wrote:
>
>> On Sat, Nov 20, 2010 at 8:47 AM, Mike Perry <mikeperrycanada@gmail.com
>> >wrote:
>>
>> > Hello all,
>> >
>> > Does the script to convert a Lucene index to Mahout vectors write
>> sequence
>> > files in sparse vector representation? my impression is that it doesn't
>> but
>> > I want to verify that.
>> >
>>
>> Should be sparse, but I can't say for sure.
>>
>>
>> > Also, SparseVectorsFromSequenceFiles is used to convert the vectors to
>> > sparse format (I know about the seq2sparse option). Could someone point
>> out
>> > where in the code it actually constructs the sparse vectors?  it seems to
>> > me
>> > that one of the methods in DictionaryVectorizer generates the vectors but
>> I
>> > couldn't
>> > find where exactly.
>> >
>>
>> Look for VectorWritable.
>>
>

Re: Sparse Vectors

Posted by Mike Perry <mi...@gmail.com>.
Thanks Ted for the answer.

"Should be sparse, but I can't say for sure."

Could anybody confirm? in the quickstart-kmeans.sh script there's a step to
convert the data to SequenceFile format (seqdirectory) and then
a second step to convert the SequenceFiles to sparse vector format (
seq2sparse). That's why I'm asking.


On Sat, Nov 20, 2010 at 3:45 PM, Ted Dunning <te...@gmail.com> wrote:

> On Sat, Nov 20, 2010 at 8:47 AM, Mike Perry <mikeperrycanada@gmail.com
> >wrote:
>
> > Hello all,
> >
> > Does the script to convert a Lucene index to Mahout vectors write
> sequence
> > files in sparse vector representation? my impression is that it doesn't
> but
> > I want to verify that.
> >
>
> Should be sparse, but I can't say for sure.
>
>
> > Also, SparseVectorsFromSequenceFiles is used to convert the vectors to
> > sparse format (I know about the seq2sparse option). Could someone point
> out
> > where in the code it actually constructs the sparse vectors?  it seems to
> > me
> > that one of the methods in DictionaryVectorizer generates the vectors but
> I
> > couldn't
> > find where exactly.
> >
>
> Look for VectorWritable.
>

Re: Sparse Vectors

Posted by Ted Dunning <te...@gmail.com>.
On Sat, Nov 20, 2010 at 8:47 AM, Mike Perry <mi...@gmail.com>wrote:

> Hello all,
>
> Does the script to convert a Lucene index to Mahout vectors write sequence
> files in sparse vector representation? my impression is that it doesn't but
> I want to verify that.
>

Should be sparse, but I can't say for sure.


> Also, SparseVectorsFromSequenceFiles is used to convert the vectors to
> sparse format (I know about the seq2sparse option). Could someone point out
> where in the code it actually constructs the sparse vectors?  it seems to
> me
> that one of the methods in DictionaryVectorizer generates the vectors but I
> couldn't
> find where exactly.
>

Look for VectorWritable.