You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Julian Limon <ju...@tukipa.com> on 2011/04/25 21:32:21 UTC

Top items in a vector

Hello all,

I'm using SVD to reduce the dimensionality of a text corpus. When I get
queries, I generate a new matrix with them (based on the dictionary of the
index) and apply the same matrix transformation. Finally, I
multiply (SVD'd) the index matrix by the (SVD'd) query matrix to get a
similarity vector for each query.

My question is, is there a class (or a command-line instruction) that
generates the top items from this vector? I know that Taste has a
abstraction called "TopItems", but I wonder if a similar thing exists for
vectors.

Thanks a lot,

Julian

Re: Top items in a vector

Posted by Ted Dunning <te...@gmail.com>.

Yeah... you keep wanting to solve fun problems.

The OP had a very simple problem.

On Mon, Apr 25, 2011 at 3:17 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> But then again i think i misunderstood the problem the second time...
>
> On Mon, Apr 25, 2011 at 2:59 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> > oh never mind.
> >
> > that's trivial. As Ted mentioned, i perhaps by mistake assumed the
> > problem is to find most frequently used queries.
> >
> > if he just needs top N with the highest similarity score...well...
> > that's kind of a problem i am solving for LSI over hbase right now...
> > I don't want to disclose exactly how, or Ted will say that's not the
> > way :) But, there are definitely ways to organize the vector space
> > model to find N closest without scanning the entire vector set.
> >
>

Re: Top items in a vector

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

But then again i think i misunderstood the problem the second time...

On Mon, Apr 25, 2011 at 2:59 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> oh never mind.
>
> that's trivial. As Ted mentioned, i perhaps by mistake assumed the
> problem is to find most frequently used queries.
>
> if he just needs top N with the highest similarity score...well...
> that's kind of a problem i am solving for LSI over hbase right now...
> I don't want to disclose exactly how, or Ted will say that's not the
> way :) But, there are definitely ways to organize the vector space
> model to find N closest without scanning the entire vector set.
>
> -d
>
>
>
> On Mon, Apr 25, 2011 at 2:23 PM, Jake Mannix <ja...@gmail.com> wrote:
>> On Mon, Apr 25, 2011 at 2:03 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>>> if you "forget" some of the items you saw along with the counters,
>>> how'd you add them back to the heap?
>>>
>>
>> What do you mean?  This is the common search problem: you're iterating over
>> a list of int, double pairs (int is "docId", double is the "score"), and you
>> want to
>> keep the topK out of all N of them.  Keep a size-K heap of docId/score
>> pairs,
>> and as you iterate, check to see if your current one beats your current K'th
>> best
>> result - if not, you know it will *never* be in the overall topK, and it can
>> be discarded
>> (O(1) operation, this happens most of the time if K << n, and the list is
>> randomly ordered).  If it beats the bottom of the heap, insert it (O(log(K)
>> operation),
>> dropping the old bottom of the heap.
>>
>> This is O(n + k*log(k)) for a randomly sorted list.
>>
>>  -jake
>>
>

Re: Top items in a vector

Posted by Julian Limon <ju...@tukipa.com>.

2011/4/25 Dmitriy Lyubimov <dl...@gmail.com>

> if he just needs top N with the highest similarity score...well...
> that's kind of a problem i am solving for LSI over hbase right now...
> I don't want to disclose exactly how, or Ted will say that's not the
> way :) But, there are definitely ways to organize the vector space
> model to find N closest without scanning the entire vector set.
>
> -d
>
>
>
Hello Dmitry,

I'd love to know more about this non-disclosable solution! I'm basically
using Mahout's command-line instructions to model an LSI space (along with
some custom code to get the query to execute against the original
dictionary). My guess is that my current solution is too naive and does not
take advantage of the full capabilities of Mahout. If you want to disclose
some of the specifics, I'd love open a new thread to discuss my approach.

Thanks,

Julian

Re: Top items in a vector

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Sorry Ted,

I just want to reserve the right to do some of my own try-and-fail
experimentation and have fun in the process :)


On Mon, Apr 25, 2011 at 3:04 PM, Ted Dunning <te...@gmail.com> wrote:
> Ouch!
>
> On Mon, Apr 25, 2011 at 2:59 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
>> I don't want to disclose exactly how, or Ted will say that's not the
>> way :)
>>
>

Re: Top items in a vector

Posted by Ted Dunning <te...@gmail.com>.

Ouch!

On Mon, Apr 25, 2011 at 2:59 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> I don't want to disclose exactly how, or Ted will say that's not the
> way :)
>

Re: Top items in a vector

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

oh never mind.

that's trivial. As Ted mentioned, i perhaps by mistake assumed the
problem is to find most frequently used queries.

if he just needs top N with the highest similarity score...well...
that's kind of a problem i am solving for LSI over hbase right now...
I don't want to disclose exactly how, or Ted will say that's not the
way :) But, there are definitely ways to organize the vector space
model to find N closest without scanning the entire vector set.

-d

On Mon, Apr 25, 2011 at 2:23 PM, Jake Mannix <ja...@gmail.com> wrote:
> On Mon, Apr 25, 2011 at 2:03 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
>> if you "forget" some of the items you saw along with the counters,
>> how'd you add them back to the heap?
>>
>
> What do you mean?  This is the common search problem: you're iterating over
> a list of int, double pairs (int is "docId", double is the "score"), and you
> want to
> keep the topK out of all N of them.  Keep a size-K heap of docId/score
> pairs,
> and as you iterate, check to see if your current one beats your current K'th
> best
> result - if not, you know it will *never* be in the overall topK, and it can
> be discarded
> (O(1) operation, this happens most of the time if K << n, and the list is
> randomly ordered).  If it beats the bottom of the heap, insert it (O(log(K)
> operation),
> dropping the old bottom of the heap.
>
> This is O(n + k*log(k)) for a randomly sorted list.
>
>  -jake
>

Re: Top items in a vector

Posted by Ted Dunning <te...@gmail.com>.

On Mon, Apr 25, 2011 at 3:01 PM, Julian Limon <ju...@tukipa.com>wrote:

>  I will implement
> this using the algorithms that were suggested here. I also think it would
> be
> great if there was a more general class that could be instantiated with a
> vector, but for my specific case it's probably not needed.
>

Don't implement the actual algorithms.  Use an existing priority queue.
 Even just a
java collections version based on Vector.Entry would be fine.

> Again, thanks everyone. I'm surprised of the level of engagement of this
> community. This is definitely a huge plus to our decision to go with
> Mahout.
>

Aw... shucks!

Thanks.  Pass the word!

Re: Top items in a vector

Posted by Julian Limon <ju...@tukipa.com>.

Thanks a lot, everyone!

This is really helpful. I only expect to get the top k items out of an
n-sized vector. The heap should be small enough to fit in memory and (for my
case) it might be an overkill to try to parallelize it. I will implement
this using the algorithms that were suggested here. I also think it would be
great if there was a more general class that could be instantiated with a
vector, but for my specific case it's probably not needed.

Again, thanks everyone. I'm surprised of the level of engagement of this
community. This is definitely a huge plus to our decision to go with Mahout.

Thanks,

Julian

2011/4/25 Ted Dunning <te...@gmail.com>

> The simplest implementation is O(n log(k)) (inserts every element, then
> drops at most one element).
>
> Caching the k-th best element and only inserting new items has the same
> worst case behavior in ordered cases.
>
> There is a true O(n + k) algorithm based on quicksort (I think).
>
> On Mon, Apr 25, 2011 at 2:26 PM, Jake Mannix <ja...@gmail.com>
> wrote:
>
> > On Mon, Apr 25, 2011 at 2:23 PM, Jake Mannix <ja...@gmail.com>
> > wrote:
> > >
> > >
> > > This is O(n + k*log(k)) for a randomly sorted list.
> > >
> >
> > This is a bit sloppy: there's a multiplicative factor of log(n) also in
> the
> > second
> > additive factor, but that's still linear overall.
> >
> >  -jake
> >
>

Re: Top items in a vector

Posted by Jake Mannix <ja...@gmail.com>.

On Mon, Apr 25, 2011 at 4:48 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> On Mon, Apr 25, 2011 at 4:26 PM, Ted Dunning <te...@gmail.com>
> wrote:
> >
> > BUT
> >
> > As Jake says, this is all really just O(n) and is so fast that none of
> this
> > matters to the user.
> >
>
> Unless you have 1ms alotted for this... in which case you really have
> no choice but to stop early and and hope to get really best, perhaps
> using some clever heuristical structures that may help to improve
> results such as inverted indices...
>

Yes, in the real search case, you hopefully have a static prior which allows
your list to be "semi-sorted" so that you can early-terminate, otherwise eve
linear isn't really scalable for the live query use case (even with
parallelism thrown into the mix).

  -jake

Re: Top items in a vector

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Mon, Apr 25, 2011 at 4:26 PM, Ted Dunning <te...@gmail.com> wrote:
>
> BUT
>
> As Jake says, this is all really just O(n) and is so fast that none of this
> matters to the user.
>

Unless you have 1ms alotted for this... in which case you really have
no choice but to stop early and and hope to get really best, perhaps
using some clever heuristical structures that may help to improve
results such as inverted indices...

i remember scanning a 50-million memory mapped inverted index some
years ago and i remember i couldn't get more than 500k results looked
thru in that scan in less than 40 ms... Which was kind of more than i
had...  with this very heap-accumulating technique (and i did not even
use PriorityQueue cause it really did not have -- or i haven't found
-- the operation of replacing the heap+sift thru instead of
removing+inserting... which did manage to make a small dent in
efficiency.

Re: Top items in a vector

Posted by Jake Mannix <ja...@gmail.com>.

On Mon, Apr 25, 2011 at 4:26 PM, Ted Dunning <te...@gmail.com> wrote:
>
> Pedantically speaking, that is.  If we talking worst case bounds, then the
> worst case comes into the conversation though it be of little practical
> import.


If we're already talking about probabilistic algorithms, talking about worst
case starts to get a little silly, IMO.

Indeed.  I have never implemented anything else.  Note that the worst case
> is still linear in n.  It just has the full k log k multiplier.  Normally
> the multiplier is much less but it seems (after a full 10 seconds of
> reflection) that you still have a total of O(n + log n log k) if you cache
>

I think that's O(n + log(n) * k * log(k)), because you have to re-heapify,
but yeah, same effect:

BUT
>
> As Jake says, this is all really just O(n) and is so fast that none of this
> matters to the user.
>

Re: Top items in a vector

Posted by Ted Dunning <te...@gmail.com>.

On Mon, Apr 25, 2011 at 4:15 PM, Jake Mannix <ja...@gmail.com> wrote:

> On Mon, Apr 25, 2011 at 2:45 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > The simplest implementation is O(n log(k)) (inserts every element, then
> > drops at most one element).
> >
> > Caching the k-th best element and only inserting new items has the same
> > worst case behavior in ordered cases.
> >
>
> why on earth would the current situation at hand ever have a fully ordered
> vector?
>

Pedantically speaking, that is.  If we talking worst case bounds, then the
worst case comes into the conversation though it be of little practical
import.

> This is a search problem, and the simple search solution works almost
> completely linearly.
>

Indeed.  I have never implemented anything else.  Note that the worst case
is still linear in n.  It just has the full k log k multiplier.  Normally
the multiplier is much less but it seems (after a full 10 seconds of
reflection) that you still have a total of O(n + log n log k) if you cache
the best element.  I say this because you get O(n) by simply examining each
element and then the probability of the j-th element seen making it into the
heap is k / j when j >= k.  This means that the total cost goes like log k
(sum_{j=1..n} k / j) = O(log n log k)

BUT

As Jake says, this is all really just O(n) and is so fast that none of this
matters to the user.

Re: Top items in a vector

Posted by Jake Mannix <ja...@gmail.com>.

On Mon, Apr 25, 2011 at 2:45 PM, Ted Dunning <te...@gmail.com> wrote:

> The simplest implementation is O(n log(k)) (inserts every element, then
> drops at most one element).
>
> Caching the k-th best element and only inserting new items has the same
> worst case behavior in ordered cases.
>

why on earth would the current situation at hand ever have a fully ordered
vector?

This is a search problem, and the simple search solution works almost
completely linearly.

  -jake

Re: Top items in a vector

Posted by Ted Dunning <te...@gmail.com>.

The simplest implementation is O(n log(k)) (inserts every element, then
drops at most one element).

Caching the k-th best element and only inserting new items has the same
worst case behavior in ordered cases.

There is a true O(n + k) algorithm based on quicksort (I think).

On Mon, Apr 25, 2011 at 2:26 PM, Jake Mannix <ja...@gmail.com> wrote:

> On Mon, Apr 25, 2011 at 2:23 PM, Jake Mannix <ja...@gmail.com>
> wrote:
> >
> >
> > This is O(n + k*log(k)) for a randomly sorted list.
> >
>
> This is a bit sloppy: there's a multiplicative factor of log(n) also in the
> second
> additive factor, but that's still linear overall.
>
>  -jake
>

Re: Top items in a vector

Posted by Jake Mannix <ja...@gmail.com>.

On Mon, Apr 25, 2011 at 2:23 PM, Jake Mannix <ja...@gmail.com> wrote:
>
>
> This is O(n + k*log(k)) for a randomly sorted list.
>

This is a bit sloppy: there's a multiplicative factor of log(n) also in the
second
additive factor, but that's still linear overall.

 -jake

Re: Top items in a vector

Posted by Jake Mannix <ja...@gmail.com>.

On Mon, Apr 25, 2011 at 2:03 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> if you "forget" some of the items you saw along with the counters,
> how'd you add them back to the heap?
>

What do you mean?  This is the common search problem: you're iterating over
a list of int, double pairs (int is "docId", double is the "score"), and you
want to
keep the topK out of all N of them.  Keep a size-K heap of docId/score
pairs,
and as you iterate, check to see if your current one beats your current K'th
best
result - if not, you know it will *never* be in the overall topK, and it can
be discarded
(O(1) operation, this happens most of the time if K << n, and the list is
randomly ordered).  If it beats the bottom of the heap, insert it (O(log(K)
operation),
dropping the old bottom of the heap.

This is O(n + k*log(k)) for a randomly sorted list.

  -jake

Re: Top items in a vector

Posted by Jake Mannix <ja...@gmail.com>.

On Mon, Apr 25, 2011 at 2:33 PM, Ted Dunning <te...@gmail.com> wrote:

> There are two meanings of top floating around in this thread.
>
> I think that the OP was asking for "find the indexes of the largest
> elements
> of a vector".  That is definitely what Sean answered.
>

That's definitely what I was talking about as well.

  -jake


>
> I think that Dmitriy mean "find the most common elements of a collection".
>  That is definitely harder and needs slight cleverness.
>
> On Mon, Apr 25, 2011 at 2:03 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > if you "forget" some of the items you saw along with the counters,
> > how'd you add them back to the heap?
> >
> >
> > On Mon, Apr 25, 2011 at 1:08 PM, Jake Mannix <ja...@gmail.com>
> > wrote:
> > > On Mon, Apr 25, 2011 at 1:03 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> > >
> > >> There's nothing in Mahout to do this (afaik).
> > >>
> > >> There's a statistical method for this doing that in linear time, which
> > >> is very easy to implement (and btw would make an excellent addition to
> > >> Mahout), google up 'countsketch'.
> > >>
> > >
> > > Taking top k items in a list of length n, when n >> k, is already O(n),
> > if
> > > you check the bottom of heap first before inserting.  Not sure if
> > > countsketch
> > > is necessary for the usual use case.
> > >
> > >  -jake
> > >
> >
>

Re: Top items in a vector

Posted by Ted Dunning <te...@gmail.com>.

There are two meanings of top floating around in this thread.

I think that the OP was asking for "find the indexes of the largest elements
of a vector".  That is definitely what Sean answered.

I think that Dmitriy mean "find the most common elements of a collection".
 That is definitely harder and needs slight cleverness.

On Mon, Apr 25, 2011 at 2:03 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> if you "forget" some of the items you saw along with the counters,
> how'd you add them back to the heap?
>
>
> On Mon, Apr 25, 2011 at 1:08 PM, Jake Mannix <ja...@gmail.com>
> wrote:
> > On Mon, Apr 25, 2011 at 1:03 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> >
> >> There's nothing in Mahout to do this (afaik).
> >>
> >> There's a statistical method for this doing that in linear time, which
> >> is very easy to implement (and btw would make an excellent addition to
> >> Mahout), google up 'countsketch'.
> >>
> >
> > Taking top k items in a list of length n, when n >> k, is already O(n),
> if
> > you check the bottom of heap first before inserting.  Not sure if
> > countsketch
> > is necessary for the usual use case.
> >
> >  -jake
> >
>

Re: Top items in a vector

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

assuming of course that you don't have enough space to put all
distinct items into the heap?

On Mon, Apr 25, 2011 at 2:03 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> if you "forget" some of the items you saw along with the counters,
> how'd you add them back to the heap?
>
>
> On Mon, Apr 25, 2011 at 1:08 PM, Jake Mannix <ja...@gmail.com> wrote:
>> On Mon, Apr 25, 2011 at 1:03 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>>> There's nothing in Mahout to do this (afaik).
>>>
>>> There's a statistical method for this doing that in linear time, which
>>> is very easy to implement (and btw would make an excellent addition to
>>> Mahout), google up 'countsketch'.
>>>
>>
>> Taking top k items in a list of length n, when n >> k, is already O(n), if
>> you check the bottom of heap first before inserting.  Not sure if
>> countsketch
>> is necessary for the usual use case.
>>
>>  -jake
>>
>

Re: Top items in a vector

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

if you "forget" some of the items you saw along with the counters,
how'd you add them back to the heap?


On Mon, Apr 25, 2011 at 1:08 PM, Jake Mannix <ja...@gmail.com> wrote:
> On Mon, Apr 25, 2011 at 1:03 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
>> There's nothing in Mahout to do this (afaik).
>>
>> There's a statistical method for this doing that in linear time, which
>> is very easy to implement (and btw would make an excellent addition to
>> Mahout), google up 'countsketch'.
>>
>
> Taking top k items in a list of length n, when n >> k, is already O(n), if
> you check the bottom of heap first before inserting.  Not sure if
> countsketch
> is necessary for the usual use case.
>
>  -jake
>

Re: Top items in a vector

Posted by Jake Mannix <ja...@gmail.com>.

On Mon, Apr 25, 2011 at 1:03 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> There's nothing in Mahout to do this (afaik).
>
> There's a statistical method for this doing that in linear time, which
> is very easy to implement (and btw would make an excellent addition to
> Mahout), google up 'countsketch'.
>

Taking top k items in a list of length n, when n >> k, is already O(n), if
you check the bottom of heap first before inserting.  Not sure if
countsketch
is necessary for the usual use case.

  -jake

Re: Top items in a vector

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

There's nothing in Mahout to do this (afaik).

There's a statistical method for this doing that in linear time, which
is very easy to implement (and btw would make an excellent addition to
Mahout), google up 'countsketch'.

It is an offline algorithm and although i never considered it, it
might be embarassingly parallelizable for MR.

-d

On Mon, Apr 25, 2011 at 12:32 PM, Julian Limon <ju...@tukipa.com> wrote:
> Hello all,
>
> I'm using SVD to reduce the dimensionality of a text corpus. When I get
> queries, I generate a new matrix with them (based on the dictionary of the
> index) and apply the same matrix transformation. Finally, I
> multiply (SVD'd) the index matrix by the (SVD'd) query matrix to get a
> similarity vector for each query.
>
> My question is, is there a class (or a command-line instruction) that
> generates the top items from this vector? I know that Taste has a
> abstraction called "TopItems", but I wonder if a similar thing exists for
> vectors.
>
> Thanks a lot,
>
> Julian
>

Re: Top items in a vector

Posted by Sean Owen <sr...@gmail.com>.

Yeah picking the top-n things comes up repeatedly. There's one
straightforward and efficient way to do it with a heap but it's the same 15
lines of code every time and deserves refactoring.

In fact in my own Mahout-based side project I have a more general "TopN"
class that does exactly this because it comes up so much. Hmm, maybe I
should stick it in Mahout and rewire things to use this more general class.

In the meantime, adapting the logic is just a little copy-and-paste work.

On Mon, Apr 25, 2011 at 8:32 PM, Julian Limon <ju...@tukipa.com>wrote:

> Hello all,
>
> I'm using SVD to reduce the dimensionality of a text corpus. When I get
> queries, I generate a new matrix with them (based on the dictionary of the
> index) and apply the same matrix transformation. Finally, I
> multiply (SVD'd) the index matrix by the (SVD'd) query matrix to get a
> similarity vector for each query.
>
> My question is, is there a class (or a command-line instruction) that
> generates the top items from this vector? I know that Taste has a
> abstraction called "TopItems", but I wonder if a similar thing exists for
> vectors.
>
> Thanks a lot,
>
> Julian
>