You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2015/05/06 18:29:21 UTC

Re: Streaming and incremental cooccurrence

100GB of RAM is practically common. Recently I’ve seen many indicators and item metadata stored with cooccurrence and indexed. This produces extremely flexible results since the query determines the result, not the model. But it does increase the number of cooccurrences linearly with # of indicator types.

As to DB, any suggestions? It would need to have a very high performance memory cached implementation. I wonder if the search engine itself would work. This would at least reduce the number of subsystems to deal with.

On Apr 24, 2015, at 4:13 PM, Ted Dunning <te...@gmail.com> wrote:

Sounds about right.

My guess is that memory is now large enough, especially on a cluster that
the cooccurrence will fit into memory quite often.  Taking a large example
of 10 million items and 10,000 cooccurrences each, there will be 100
billion cooccurrences to store which shouldn't take more than about half a
TB of data if fully populated.  This isn't that outrageous any more.  With
SSD's as backing store, even 100GB of RAM or less might well produce very
nice results.  Depending on incoming transaction rates, using spinning disk
as a backing store might also work with small memory.

Experiments are in order.

On Fri, Apr 24, 2015 at 8:12 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Ok, seems right.
> 
> So now to data structures. The input frequency vectors need to be paired
> with each input interaction type and would be nice to have as something
> that can be copied very fast as they get updated. Random access would also
> be nice but iteration is not needed. Over time they will get larger as all
> items get interactions, users will get more actions and appear in more
> vectors (with multi-intereaction data). Seems like hashmaps?
> 
> The cooccurrence matrix is more of a question to me. It needs to be
> updatable at the row and column level, and random access for both row and
> column would be nice. It needs to be expandable. To keep it small the keys
> should be integers, not full blown ID strings. There will have to be one
> matrix per interaction type. It should be simple to update the Search
> Engine to either mirror the matrix of use it directly for index updates.
> Each indicator update should cause an index update.
> 
> Putting aside speed and size issues this sounds like a NoSQL DB table that
> is cached in-memeory.
> 
> On Apr 23, 2015, at 3:04 PM, Ted Dunning <te...@gmail.com> wrote:
> 
> On Thu, Apr 23, 2015 at 8:53 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
>> This seems to violate the random choice of interactions to cut but now
>> that I think about it does a random choice really matter?
>> 
> 
> It hasn't ever mattered such that I could see.  There is also some reason
> to claim that earliest is best if items are very focussed in time.  Of
> course, the opposite argument also applies.  That leaves us with empiricism
> where the results are not definitive.
> 
> So I don't think that it matters, but I don't think that it does.
> 
>

Re: Streaming and incremental cooccurrence

Posted by Sebastian <ss...@apache.org>.

Co-occurrence matrices shold be fairly easy to partition over many 
machines, so you would not be constrained by the memory available on a 
single machine.

On 06.05.2015 18:29, Pat Ferrel wrote:
> 100GB of RAM is practically common. Recently I’ve seen many indicators and item metadata stored with cooccurrence and indexed. This produces extremely flexible results since the query determines the result, not the model. But it does increase the number of cooccurrences linearly with # of indicator types.
>
> As to DB, any suggestions? It would need to have a very high performance memory cached implementation. I wonder if the search engine itself would work. This would at least reduce the number of subsystems to deal with.
>
> On Apr 24, 2015, at 4:13 PM, Ted Dunning <te...@gmail.com> wrote:
>
> Sounds about right.
>
> My guess is that memory is now large enough, especially on a cluster that
> the cooccurrence will fit into memory quite often.  Taking a large example
> of 10 million items and 10,000 cooccurrences each, there will be 100
> billion cooccurrences to store which shouldn't take more than about half a
> TB of data if fully populated.  This isn't that outrageous any more.  With
> SSD's as backing store, even 100GB of RAM or less might well produce very
> nice results.  Depending on incoming transaction rates, using spinning disk
> as a backing store might also work with small memory.
>
> Experiments are in order.
>
>
>
> On Fri, Apr 24, 2015 at 8:12 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
>> Ok, seems right.
>>
>> So now to data structures. The input frequency vectors need to be paired
>> with each input interaction type and would be nice to have as something
>> that can be copied very fast as they get updated. Random access would also
>> be nice but iteration is not needed. Over time they will get larger as all
>> items get interactions, users will get more actions and appear in more
>> vectors (with multi-intereaction data). Seems like hashmaps?
>>
>> The cooccurrence matrix is more of a question to me. It needs to be
>> updatable at the row and column level, and random access for both row and
>> column would be nice. It needs to be expandable. To keep it small the keys
>> should be integers, not full blown ID strings. There will have to be one
>> matrix per interaction type. It should be simple to update the Search
>> Engine to either mirror the matrix of use it directly for index updates.
>> Each indicator update should cause an index update.
>>
>> Putting aside speed and size issues this sounds like a NoSQL DB table that
>> is cached in-memeory.
>>
>> On Apr 23, 2015, at 3:04 PM, Ted Dunning <te...@gmail.com> wrote:
>>
>> On Thu, Apr 23, 2015 at 8:53 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>
>>> This seems to violate the random choice of interactions to cut but now
>>> that I think about it does a random choice really matter?
>>>
>>
>> It hasn't ever mattered such that I could see.  There is also some reason
>> to claim that earliest is best if items are very focussed in time.  Of
>> course, the opposite argument also applies.  That leaves us with empiricism
>> where the results are not definitive.
>>
>> So I don't think that it matters, but I don't think that it does.
>>
>>
>