You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2015/04/17 18:51:16 UTC

Streaming and incremental cooccurrence

I’ve been thinking about Streaming (continuous input) and incremental coccurrence.

As interactions stream in from the user it it fairly simple to use something like Spark streaming to maintain a moving time window for all input, and an update frequency that recalcs all input currently in the time window. I’ve done this with the current cooccurrence code but though streaming, this is not incremental.

The current data flow goes from interaction input to geometry and user dictionary reconciliation to A’A, A’B etc. After the multiply the resulting cooccurrence matrices are LLR weighted/filtered/down-sampled. 

Incremental can mean all sorts of things and may imply different trade-offs. Did you have anything specific in mind?

Re: Streaming and incremental cooccurrence

Posted by Ted Dunning <te...@gmail.com>.

On Sat, Apr 18, 2015 at 11:29 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> You seem to be proposing a new cut by frequency of item interaction, is
> this correct? This is because the frequency is known before the multiply
> and LLR. I assume the #2 cut is left in place?
>

Yes.  but I didn't think it was new.

Re: Streaming and incremental cooccurrence

Posted by Sebastian <ss...@apache.org>.

Co-occurrence matrices shold be fairly easy to partition over many 
machines, so you would not be constrained by the memory available on a 
single machine.

On 06.05.2015 18:29, Pat Ferrel wrote:
> 100GB of RAM is practically common. Recently I’ve seen many indicators and item metadata stored with cooccurrence and indexed. This produces extremely flexible results since the query determines the result, not the model. But it does increase the number of cooccurrences linearly with # of indicator types.
>
> As to DB, any suggestions? It would need to have a very high performance memory cached implementation. I wonder if the search engine itself would work. This would at least reduce the number of subsystems to deal with.
>
> On Apr 24, 2015, at 4:13 PM, Ted Dunning <te...@gmail.com> wrote:
>
> Sounds about right.
>
> My guess is that memory is now large enough, especially on a cluster that
> the cooccurrence will fit into memory quite often.  Taking a large example
> of 10 million items and 10,000 cooccurrences each, there will be 100
> billion cooccurrences to store which shouldn't take more than about half a
> TB of data if fully populated.  This isn't that outrageous any more.  With
> SSD's as backing store, even 100GB of RAM or less might well produce very
> nice results.  Depending on incoming transaction rates, using spinning disk
> as a backing store might also work with small memory.
>
> Experiments are in order.
>
>
>
> On Fri, Apr 24, 2015 at 8:12 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
>> Ok, seems right.
>>
>> So now to data structures. The input frequency vectors need to be paired
>> with each input interaction type and would be nice to have as something
>> that can be copied very fast as they get updated. Random access would also
>> be nice but iteration is not needed. Over time they will get larger as all
>> items get interactions, users will get more actions and appear in more
>> vectors (with multi-intereaction data). Seems like hashmaps?
>>
>> The cooccurrence matrix is more of a question to me. It needs to be
>> updatable at the row and column level, and random access for both row and
>> column would be nice. It needs to be expandable. To keep it small the keys
>> should be integers, not full blown ID strings. There will have to be one
>> matrix per interaction type. It should be simple to update the Search
>> Engine to either mirror the matrix of use it directly for index updates.
>> Each indicator update should cause an index update.
>>
>> Putting aside speed and size issues this sounds like a NoSQL DB table that
>> is cached in-memeory.
>>
>> On Apr 23, 2015, at 3:04 PM, Ted Dunning <te...@gmail.com> wrote:
>>
>> On Thu, Apr 23, 2015 at 8:53 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>
>>> This seems to violate the random choice of interactions to cut but now
>>> that I think about it does a random choice really matter?
>>>
>>
>> It hasn't ever mattered such that I could see.  There is also some reason
>> to claim that earliest is best if items are very focussed in time.  Of
>> course, the opposite argument also applies.  That leaves us with empiricism
>> where the results are not definitive.
>>
>> So I don't think that it matters, but I don't think that it does.
>>
>>
>

Re: Streaming and incremental cooccurrence

Posted by Pat Ferrel <pa...@occamsmachete.com>.

100GB of RAM is practically common. Recently I’ve seen many indicators and item metadata stored with cooccurrence and indexed. This produces extremely flexible results since the query determines the result, not the model. But it does increase the number of cooccurrences linearly with # of indicator types.

As to DB, any suggestions? It would need to have a very high performance memory cached implementation. I wonder if the search engine itself would work. This would at least reduce the number of subsystems to deal with.

On Apr 24, 2015, at 4:13 PM, Ted Dunning <te...@gmail.com> wrote:

Sounds about right.

My guess is that memory is now large enough, especially on a cluster that
the cooccurrence will fit into memory quite often.  Taking a large example
of 10 million items and 10,000 cooccurrences each, there will be 100
billion cooccurrences to store which shouldn't take more than about half a
TB of data if fully populated.  This isn't that outrageous any more.  With
SSD's as backing store, even 100GB of RAM or less might well produce very
nice results.  Depending on incoming transaction rates, using spinning disk
as a backing store might also work with small memory.

Experiments are in order.

On Fri, Apr 24, 2015 at 8:12 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Ok, seems right.
> 
> So now to data structures. The input frequency vectors need to be paired
> with each input interaction type and would be nice to have as something
> that can be copied very fast as they get updated. Random access would also
> be nice but iteration is not needed. Over time they will get larger as all
> items get interactions, users will get more actions and appear in more
> vectors (with multi-intereaction data). Seems like hashmaps?
> 
> The cooccurrence matrix is more of a question to me. It needs to be
> updatable at the row and column level, and random access for both row and
> column would be nice. It needs to be expandable. To keep it small the keys
> should be integers, not full blown ID strings. There will have to be one
> matrix per interaction type. It should be simple to update the Search
> Engine to either mirror the matrix of use it directly for index updates.
> Each indicator update should cause an index update.
> 
> Putting aside speed and size issues this sounds like a NoSQL DB table that
> is cached in-memeory.
> 
> On Apr 23, 2015, at 3:04 PM, Ted Dunning <te...@gmail.com> wrote:
> 
> On Thu, Apr 23, 2015 at 8:53 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
>> This seems to violate the random choice of interactions to cut but now
>> that I think about it does a random choice really matter?
>> 
> 
> It hasn't ever mattered such that I could see.  There is also some reason
> to claim that earliest is best if items are very focussed in time.  Of
> course, the opposite argument also applies.  That leaves us with empiricism
> where the results are not definitive.
> 
> So I don't think that it matters, but I don't think that it does.
> 
>

Re: Streaming and incremental cooccurrence

Posted by Ted Dunning <te...@gmail.com>.

Sounds about right.

My guess is that memory is now large enough, especially on a cluster that
the cooccurrence will fit into memory quite often.  Taking a large example
of 10 million items and 10,000 cooccurrences each, there will be 100
billion cooccurrences to store which shouldn't take more than about half a
TB of data if fully populated.  This isn't that outrageous any more.  With
SSD's as backing store, even 100GB of RAM or less might well produce very
nice results.  Depending on incoming transaction rates, using spinning disk
as a backing store might also work with small memory.

Experiments are in order.



On Fri, Apr 24, 2015 at 8:12 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Ok, seems right.
>
> So now to data structures. The input frequency vectors need to be paired
> with each input interaction type and would be nice to have as something
> that can be copied very fast as they get updated. Random access would also
> be nice but iteration is not needed. Over time they will get larger as all
> items get interactions, users will get more actions and appear in more
> vectors (with multi-intereaction data). Seems like hashmaps?
>
> The cooccurrence matrix is more of a question to me. It needs to be
> updatable at the row and column level, and random access for both row and
> column would be nice. It needs to be expandable. To keep it small the keys
> should be integers, not full blown ID strings. There will have to be one
> matrix per interaction type. It should be simple to update the Search
> Engine to either mirror the matrix of use it directly for index updates.
> Each indicator update should cause an index update.
>
> Putting aside speed and size issues this sounds like a NoSQL DB table that
> is cached in-memeory.
>
> On Apr 23, 2015, at 3:04 PM, Ted Dunning <te...@gmail.com> wrote:
>
> On Thu, Apr 23, 2015 at 8:53 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
> > This seems to violate the random choice of interactions to cut but now
> > that I think about it does a random choice really matter?
> >
>
> It hasn't ever mattered such that I could see.  There is also some reason
> to claim that earliest is best if items are very focussed in time.  Of
> course, the opposite argument also applies.  That leaves us with empiricism
> where the results are not definitive.
>
> So I don't think that it matters, but I don't think that it does.
>
>

Re: Streaming and incremental cooccurrence

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Ok, seems right.

So now to data structures. The input frequency vectors need to be paired with each input interaction type and would be nice to have as something that can be copied very fast as they get updated. Random access would also be nice but iteration is not needed. Over time they will get larger as all items get interactions, users will get more actions and appear in more vectors (with multi-intereaction data). Seems like hashmaps?

The cooccurrence matrix is more of a question to me. It needs to be updatable at the row and column level, and random access for both row and column would be nice. It needs to be expandable. To keep it small the keys should be integers, not full blown ID strings. There will have to be one matrix per interaction type. It should be simple to update the Search Engine to either mirror the matrix of use it directly for index updates. Each indicator update should cause an index update.

Putting aside speed and size issues this sounds like a NoSQL DB table that is cached in-memeory. 

On Apr 23, 2015, at 3:04 PM, Ted Dunning <te...@gmail.com> wrote:

On Thu, Apr 23, 2015 at 8:53 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> This seems to violate the random choice of interactions to cut but now
> that I think about it does a random choice really matter?
> 

It hasn't ever mattered such that I could see.  There is also some reason
to claim that earliest is best if items are very focussed in time.  Of
course, the opposite argument also applies.  That leaves us with empiricism
where the results are not definitive.

So I don't think that it matters, but I don't think that it does.

Re: Streaming and incremental cooccurrence

Posted by Ted Dunning <te...@gmail.com>.

On Thu, Apr 23, 2015 at 8:53 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> This seems to violate the random choice of interactions to cut but now
> that I think about it does a random choice really matter?
>

It hasn't ever mattered such that I could see.  There is also some reason
to claim that earliest is best if items are very focussed in time.  Of
course, the opposite argument also applies.  That leaves us with empiricism
where the results are not definitive.

So I don't think that it matters, but I don't think that it does.

Re: Streaming and incremental cooccurrence

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Randomizing interaction down-sampling is probably more important on the starting batch since it is done on entire input row or column, not so important when a cut-off is already reached. All new interactions (new items for instance) would not have reached the cut anyway, which is important since one of the big reasons for incremental is to quickly account for new items.

So I guess I agree that there is very little practical difference between incremental streaming and moving window streaming. There is a big difference in implementation and computation time, of course.

On Apr 23, 2015, at 5:53 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

Removal is not as important as adding (which can be done). Also removal is often for business logic, like removal from a catalog, so a refresh may be driven by non-math considerations. Removal of users is only to clean up things, not required very often. Removal of items can happen from recs too, mitigating the issue.

The way the downsampling works now is to randomly remove interactions if we know there will be too many so that we end up with the right amount. The incremental approach would filter out all new interactions that are over the limit since the old interactions are not kept. This seems to violate the random choice of interactions to cut but now that I think about it does a random choice really matter?

On Apr 22, 2015, at 10:01 PM, Ted Dunning <te...@gmail.com> wrote:

On Wed, Apr 22, 2015 at 8:07 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> I think we have been talking about an idea that does an incremental
> approximation, then a refresh every so often to remove any approximation so
> in an ideal world we need both.

Actually, the method I was pushing is exact.  If the sampling is made
deterministic using clever seeds, then deletion is even possible since you
can determine whether an observation was thrown away rather than used to
increment counts.

The only creeping crud aspect of this is the accumulation of zero rows as
things fall out of the accumulation window.  I would be tempted to not
allow deletion and just restart as Pat is suggesting.

Re: Streaming and incremental cooccurrence

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Removal is not as important as adding (which can be done). Also removal is often for business logic, like removal from a catalog, so a refresh may be driven by non-math considerations. Removal of users is only to clean up things, not required very often. Removal of items can happen from recs too, mitigating the issue.

The way the downsampling works now is to randomly remove interactions if we know there will be too many so that we end up with the right amount. The incremental approach would filter out all new interactions that are over the limit since the old interactions are not kept. This seems to violate the random choice of interactions to cut but now that I think about it does a random choice really matter?

On Apr 22, 2015, at 10:01 PM, Ted Dunning <te...@gmail.com> wrote:

On Wed, Apr 22, 2015 at 8:07 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> I think we have been talking about an idea that does an incremental
> approximation, then a refresh every so often to remove any approximation so
> in an ideal world we need both.

Actually, the method I was pushing is exact.  If the sampling is made
deterministic using clever seeds, then deletion is even possible since you
can determine whether an observation was thrown away rather than used to
increment counts.

The only creeping crud aspect of this is the accumulation of zero rows as
things fall out of the accumulation window.  I would be tempted to not
allow deletion and just restart as Pat is suggesting.

Re: Streaming and incremental cooccurrence

Posted by Ted Dunning <te...@gmail.com>.

On Wed, Apr 22, 2015 at 8:07 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> I think we have been talking about an idea that does an incremental
> approximation, then a refresh every so often to remove any approximation so
> in an ideal world we need both.

Actually, the method I was pushing is exact.  If the sampling is made
deterministic using clever seeds, then deletion is even possible since you
can determine whether an observation was thrown away rather than used to
increment counts.

The only creeping crud aspect of this is the accumulation of zero rows as
things fall out of the accumulation window.  I would be tempted to not
allow deletion and just restart as Pat is suggesting.

Re: Streaming and incremental cooccurrence

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Currently maxPrefs is applied to input, both row and column (in hadoop and scala) and has a default of 500. maxSimilaritiesPerItem is for the cooccurrence matrix and is applied to rows. The default is 50. Similar down-sampling is done on row similarity.

For a new way to use threshold I was thinking of one that is relative to the data itself and would always produce the same number of items in the input but based only on a quailty threshold, not row and column counts. From Sebastian’s paper this may not produce much benefit and the downside is that the input distribution parameters must be calculated before sparsification. This is avoided with a fixed threshold and/or row and column count downsampling.

BTW there is another half-way method to do part of this by juggling DStreams and RDDs. Trade-offs apply of course. 

The idea would be to make Cooccurrence a streaming operation fed by an Update Period of micro-batches. Keeping the input as a DStream allows us to drop old data when new nano-batches come in but the entire time window is potentially large, maybe months for long lived items. The time window would be fed to Cooccurrence periodically.

The benefit is that the process never reads persisted data (a fairly time consuming operation with nano-batches) but is passed new RDDs that have come from some streaming input (Kafka?)
The downside is that it still needs the entire time window’s worth of data for the calc. In Spark terms the input is taken from a DStream.

I think we have been talking about an idea that does an incremental approximation, then a refresh every so often to remove any approximation so in an ideal world we need both.

Streaming but non incremental would be relatively easy and use current math code. Incremental would require in-memory data structures of custom design. 

On Apr 19, 2015, at 8:39 PM, Ted Dunning <te...@gmail.com> wrote:

Inline

On Sun, Apr 19, 2015 at 11:05 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Short answer, you are correct this is not a new filter.
> 
> The Hadoop MapReduce implements:
> * maxSimilaritiesPerItem
> * maxPrefs
> * minPrefsPerUser
> * threshold
> 
> Scala version:
> * maxSimilaritiesPerItem
> 

I think of this as "column-wise", but that may be bad terminology.

> * maxPrefs
> 

And I think of this as "row-wise" or "user limit".  I think it is the
interaction-cut from the paper.

> 
> The paper talks about an interaction-cut, and describes it with "There is
> no significant decrease in the error for incorporating more interactions
> from the ‘power users’ after that.” While I’d trust your reading better
> than mine I thought that meant dowsampling overactive users.
> 

I agree.

> 
> However both the Hadoop Mapreduce and the Scala version downsample both
> user and item interactions by maxPrefs. So you are correct, not a new thing.
> 
> The paper also talks about the threshold and we’ve talked on the list
> about how better to implement that. A fixed number is not very useful so a
> number of sigmas was proposed but is not yet implemented.
> 

I think that both  minPrefsPerUser and threshold have limited utility in
the current code.  Could be wrong about that.

With low quality association measures that suffer from low count problems
or simplisitic user-based methods, minPrefsPerUser can be crucial.
Threshold can also be required for systems like that.

The Scala code doesn't have that problem since it doesn't support those
metrics.

Re: Streaming and incremental cooccurrence

Posted by Ted Dunning <te...@gmail.com>.

Inline

On Sun, Apr 19, 2015 at 11:05 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Short answer, you are correct this is not a new filter.
>
> The Hadoop MapReduce implements:
> * maxSimilaritiesPerItem
> * maxPrefs
> * minPrefsPerUser
> * threshold
>
> Scala version:
> * maxSimilaritiesPerItem
>

I think of this as "column-wise", but that may be bad terminology.

> * maxPrefs
>

And I think of this as "row-wise" or "user limit".  I think it is the
interaction-cut from the paper.

>
> The paper talks about an interaction-cut, and describes it with "There is
> no significant decrease in the error for incorporating more interactions
> from the ‘power users’ after that.” While I’d trust your reading better
> than mine I thought that meant dowsampling overactive users.
>

I agree.

>
> However both the Hadoop Mapreduce and the Scala version downsample both
> user and item interactions by maxPrefs. So you are correct, not a new thing.
>
> The paper also talks about the threshold and we’ve talked on the list
> about how better to implement that. A fixed number is not very useful so a
> number of sigmas was proposed but is not yet implemented.
>

I think that both  minPrefsPerUser and threshold have limited utility in
the current code.  Could be wrong about that.

With low quality association measures that suffer from low count problems
or simplisitic user-based methods, minPrefsPerUser can be crucial.
Threshold can also be required for systems like that.

The Scala code doesn't have that problem since it doesn't support those
metrics.

Re: Streaming and incremental cooccurrence

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Short answer, you are correct this is not a new filter.

The Hadoop MapReduce implements:
* maxSimilaritiesPerItem
* maxPrefs
* minPrefsPerUser
* threshold

Scala version:
* maxSimilaritiesPerItem
* maxPrefs

The paper talks about an interaction-cut, and describes it with "There is no significant decrease in the error for incorporating more interactions from the ‘power users’ after that.” While I’d trust your reading better than mine I thought that meant dowsampling overactive users. 

However both the Hadoop Mapreduce and the Scala version downsample both user and item interactions by maxPrefs. So you are correct, not a new thing.

The paper also talks about the threshold and we’ve talked on the list about how better to implement that. A fixed number is not very useful so a number of sigmas was proposed but is not yet implemented. 


On Apr 18, 2015, at 4:39 PM, Ted Dunning <te...@gmail.com> wrote:

I just answered in another thread.

Yes, I am.  I just didn't think I was proposing it.  Thought it was in
Sebastian's paper and ultimately in our code (that I haven't looked at in
over a year).


On Sat, Apr 18, 2015 at 7:38 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Hey Ted,
> 
> You seem to be proposing a new cut by frequency of item interaction, is
> this correct? It is performed before the multiply, right?
> 
> 
> On Apr 18, 2015, at 8:29 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
> There are two cuts applied in the batch calc:
> 1) number of interactions per user
> 2) number of items in the resulting cooccurrence vectors (calc LLR, sort,
> lowest items cut per limit)
> 
> You seem to be proposing a new cut by frequency of item interaction, is
> this correct? This is because the frequency is known before the multiply
> and LLR. I assume the #2 cut is left in place?
> 
> If so you would need something like 4 things with random accesss/in memory
> speed
> 1&2) frequency vectors for interactions pre user and item, these may be
> updated and are used to calculate LLR and also for cutting new update
> interactions.
> 3) cooccurrence matrix with LLR weights—this is also stored in search
> engine or DB (without weights) so any update needs to trigger an engine
> index update.
> 4) item dictionary
> 
> #3 might not be a matrix but a hashmap key = item-id, value = vector of
> items. If the vector has item keys = int you would also need the item
> dictionary.
> 
> Given the frequency vectors it seems like the interaction matrices are no
> longer needed?
> 
> 
> On Apr 17, 2015, at 7:25 PM, Ted Dunning <te...@gmail.com> wrote:
> 
> 
> Yes. Also add the fact that the nano batches are bounded tightly in size
> both max and mean. And mostly filtered away anyway.
> 
> Aging is an open question. I have never seen any effect of alternative
> sampling so I would just assume "keep oldest" which just tosses more
> samples. Then occasionally rebuild from batch if you really want aging to
> go right.
> 
> Search updates any more are true realtime also so that works very well.
> 
> Sent from my iPhone
> 
>> On Apr 17, 2015, at 17:20, Pat Ferrel <pa...@occamsmachete.com> wrote:
>> 
>> Thanks.
>> 
>> This idea is based on a micro-batch of interactions per update, not
> individual ones unless I missed something. That matches the typical input
> flow. Most interactions are filtered away by  frequency and number of
> interaction cuts.
>> 
>> A couple practical issues
>> 
>> In practice won’t this require aging of interactions too? So wouldn’t
> the update require some old interaction removal? I suppose this might just
> take the form of added null interactions representing the geriatric ones?
> Haven’t gone through the math with enough detail to see if you’ve already
> accounted for this.
>> 
>> To use actual math (self-join, etc.) we still need to alter the geometry
> of the interactions to have the same row rank as the adjusted total. In
> other words the number of rows in all resulting interactions must be the
> same. Over time this means completely removing rows and columns or allowing
> empty rows in potentially all input matrices.
>> 
>> Might not be too bad to accumulate gaps in rows and columns. Not sure if
> it would have a practical impact (to some large limit) as long as it was
> done, to keep the real size more or less fixed.
>> 
>> As to realtime, that would be under search engine control through
> incremental indexing and there are a couple ways to do that, not a problem
> afaik. As you point out the query always works and is real time. The index
> update must be frequent and not impact the engine's availability for
> queries.
>> 
>> On Apr 17, 2015, at 2:46 PM, Ted Dunning <te...@gmail.com> wrote:
>> 
>> 
>> When I think of real-time adaptation of indicators, I think of this:
>> 
>> 
> http://www.slideshare.net/tdunning/realtime-puppies-and-ponies-evolving-indicator-recommendations-in-realtime
>> 
>> 
>>> On Fri, Apr 17, 2015 at 6:51 PM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
>>> I’ve been thinking about Streaming (continuous input) and incremental
> coccurrence.
>>> 
>>> As interactions stream in from the user it it fairly simple to use
> something like Spark streaming to maintain a moving time window for all
> input, and an update frequency that recalcs all input currently in the time
> window. I’ve done this with the current cooccurrence code but though
> streaming, this is not incremental.
>>> 
>>> The current data flow goes from interaction input to geometry and user
> dictionary reconciliation to A’A, A’B etc. After the multiply the resulting
> cooccurrence matrices are LLR weighted/filtered/down-sampled.
>>> 
>>> Incremental can mean all sorts of things and may imply different
> trade-offs. Did you have anything specific in mind?
>> 
>> 
> 
> 
>

Re: Streaming and incremental cooccurrence

Posted by Ted Dunning <te...@gmail.com>.

I just answered in another thread.

Yes, I am.  I just didn't think I was proposing it.  Thought it was in
Sebastian's paper and ultimately in our code (that I haven't looked at in
over a year).


On Sat, Apr 18, 2015 at 7:38 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Hey Ted,
>
> You seem to be proposing a new cut by frequency of item interaction, is
> this correct? It is performed before the multiply, right?
>
>
> On Apr 18, 2015, at 8:29 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
> There are two cuts applied in the batch calc:
> 1) number of interactions per user
> 2) number of items in the resulting cooccurrence vectors (calc LLR, sort,
> lowest items cut per limit)
>
> You seem to be proposing a new cut by frequency of item interaction, is
> this correct? This is because the frequency is known before the multiply
> and LLR. I assume the #2 cut is left in place?
>
> If so you would need something like 4 things with random accesss/in memory
> speed
> 1&2) frequency vectors for interactions pre user and item, these may be
> updated and are used to calculate LLR and also for cutting new update
> interactions.
> 3) cooccurrence matrix with LLR weights—this is also stored in search
> engine or DB (without weights) so any update needs to trigger an engine
> index update.
> 4) item dictionary
>
> #3 might not be a matrix but a hashmap key = item-id, value = vector of
> items. If the vector has item keys = int you would also need the item
> dictionary.
>
> Given the frequency vectors it seems like the interaction matrices are no
> longer needed?
>
>
> On Apr 17, 2015, at 7:25 PM, Ted Dunning <te...@gmail.com> wrote:
>
>
> Yes. Also add the fact that the nano batches are bounded tightly in size
> both max and mean. And mostly filtered away anyway.
>
> Aging is an open question. I have never seen any effect of alternative
> sampling so I would just assume "keep oldest" which just tosses more
> samples. Then occasionally rebuild from batch if you really want aging to
> go right.
>
> Search updates any more are true realtime also so that works very well.
>
> Sent from my iPhone
>
> > On Apr 17, 2015, at 17:20, Pat Ferrel <pa...@occamsmachete.com> wrote:
> >
> > Thanks.
> >
> > This idea is based on a micro-batch of interactions per update, not
> individual ones unless I missed something. That matches the typical input
> flow. Most interactions are filtered away by  frequency and number of
> interaction cuts.
> >
> > A couple practical issues
> >
> > In practice won’t this require aging of interactions too? So wouldn’t
> the update require some old interaction removal? I suppose this might just
> take the form of added null interactions representing the geriatric ones?
> Haven’t gone through the math with enough detail to see if you’ve already
> accounted for this.
> >
> > To use actual math (self-join, etc.) we still need to alter the geometry
> of the interactions to have the same row rank as the adjusted total. In
> other words the number of rows in all resulting interactions must be the
> same. Over time this means completely removing rows and columns or allowing
> empty rows in potentially all input matrices.
> >
> > Might not be too bad to accumulate gaps in rows and columns. Not sure if
> it would have a practical impact (to some large limit) as long as it was
> done, to keep the real size more or less fixed.
> >
> > As to realtime, that would be under search engine control through
> incremental indexing and there are a couple ways to do that, not a problem
> afaik. As you point out the query always works and is real time. The index
> update must be frequent and not impact the engine's availability for
> queries.
> >
> > On Apr 17, 2015, at 2:46 PM, Ted Dunning <te...@gmail.com> wrote:
> >
> >
> > When I think of real-time adaptation of indicators, I think of this:
> >
> >
> http://www.slideshare.net/tdunning/realtime-puppies-and-ponies-evolving-indicator-recommendations-in-realtime
> >
> >
> >> On Fri, Apr 17, 2015 at 6:51 PM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> >> I’ve been thinking about Streaming (continuous input) and incremental
> coccurrence.
> >>
> >> As interactions stream in from the user it it fairly simple to use
> something like Spark streaming to maintain a moving time window for all
> input, and an update frequency that recalcs all input currently in the time
> window. I’ve done this with the current cooccurrence code but though
> streaming, this is not incremental.
> >>
> >> The current data flow goes from interaction input to geometry and user
> dictionary reconciliation to A’A, A’B etc. After the multiply the resulting
> cooccurrence matrices are LLR weighted/filtered/down-sampled.
> >>
> >> Incremental can mean all sorts of things and may imply different
> trade-offs. Did you have anything specific in mind?
> >
> >
>
>
>

Re: Streaming and incremental cooccurrence

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Hey Ted,

You seem to be proposing a new cut by frequency of item interaction, is this correct? It is performed before the multiply, right?

On Apr 18, 2015, at 8:29 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

There are two cuts applied in the batch calc:
1) number of interactions per user
2) number of items in the resulting cooccurrence vectors (calc LLR, sort, lowest items cut per limit)

You seem to be proposing a new cut by frequency of item interaction, is this correct? This is because the frequency is known before the multiply and LLR. I assume the #2 cut is left in place?

If so you would need something like 4 things with random accesss/in memory speed
1&2) frequency vectors for interactions pre user and item, these may be updated and are used to calculate LLR and also for cutting new update interactions.
3) cooccurrence matrix with LLR weights—this is also stored in search engine or DB (without weights) so any update needs to trigger an engine index update.
4) item dictionary

#3 might not be a matrix but a hashmap key = item-id, value = vector of items. If the vector has item keys = int you would also need the item dictionary.

Given the frequency vectors it seems like the interaction matrices are no longer needed?

On Apr 17, 2015, at 7:25 PM, Ted Dunning <te...@gmail.com> wrote:

Yes. Also add the fact that the nano batches are bounded tightly in size both max and mean. And mostly filtered away anyway. 

Aging is an open question. I have never seen any effect of alternative sampling so I would just assume "keep oldest" which just tosses more samples. Then occasionally rebuild from batch if you really want aging to go right.  

Search updates any more are true realtime also so that works very well. 

Sent from my iPhone

> On Apr 17, 2015, at 17:20, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
> Thanks. 
> 
> This idea is based on a micro-batch of interactions per update, not individual ones unless I missed something. That matches the typical input flow. Most interactions are filtered away by  frequency and number of interaction cuts.
> 
> A couple practical issues
> 
> In practice won’t this require aging of interactions too? So wouldn’t the update require some old interaction removal? I suppose this might just take the form of added null interactions representing the geriatric ones? Haven’t gone through the math with enough detail to see if you’ve already accounted for this.
> 
> To use actual math (self-join, etc.) we still need to alter the geometry of the interactions to have the same row rank as the adjusted total. In other words the number of rows in all resulting interactions must be the same. Over time this means completely removing rows and columns or allowing empty rows in potentially all input matrices.
> 
> Might not be too bad to accumulate gaps in rows and columns. Not sure if it would have a practical impact (to some large limit) as long as it was done, to keep the real size more or less fixed.
> 
> As to realtime, that would be under search engine control through incremental indexing and there are a couple ways to do that, not a problem afaik. As you point out the query always works and is real time. The index update must be frequent and not impact the engine's availability for queries.
> 
> On Apr 17, 2015, at 2:46 PM, Ted Dunning <te...@gmail.com> wrote:
> 
> 
> When I think of real-time adaptation of indicators, I think of this:
> 
> http://www.slideshare.net/tdunning/realtime-puppies-and-ponies-evolving-indicator-recommendations-in-realtime
> 
> 
>> On Fri, Apr 17, 2015 at 6:51 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>> I’ve been thinking about Streaming (continuous input) and incremental coccurrence.
>> 
>> As interactions stream in from the user it it fairly simple to use something like Spark streaming to maintain a moving time window for all input, and an update frequency that recalcs all input currently in the time window. I’ve done this with the current cooccurrence code but though streaming, this is not incremental.
>> 
>> The current data flow goes from interaction input to geometry and user dictionary reconciliation to A’A, A’B etc. After the multiply the resulting cooccurrence matrices are LLR weighted/filtered/down-sampled.
>> 
>> Incremental can mean all sorts of things and may imply different trade-offs. Did you have anything specific in mind?
> 
>

Re: Streaming and incremental cooccurrence

Posted by Ted Dunning <te...@gmail.com>.

On Sat, Apr 18, 2015 at 11:29 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> If so you would need something like 4 things with random accesss/in memory
> speed
> 1&2) frequency vectors for interactions pre user and item, these may be
> updated and are used to calculate LLR and also for cutting new update
> interactions.
>

Yes.


> 3) cooccurrence matrix with LLR weights—this is also stored in search
> engine or DB (without weights) so any update needs to trigger an engine
> index update.
>

Yes.  But I actually think that this might work well as a cached non-memory
table.  TBD.


> 4) item dictionary
>

Yes.


>
> #3 might not be a matrix but a hashmap key = item-id, value = vector of
> items. If the vector has item keys = int you would also need the item
> dictionary.
>

Yes.


>
> Given the frequency vectors it seems like the interaction matrices are no
> longer needed?
>

Not needed, but I would definitely keep them around in the form of logs.

Re: Streaming and incremental cooccurrence

Posted by Pat Ferrel <pa...@occamsmachete.com>.

There are two cuts applied in the batch calc:
1) number of interactions per user
2) number of items in the resulting cooccurrence vectors (calc LLR, sort, lowest items cut per limit)

You seem to be proposing a new cut by frequency of item interaction, is this correct? This is because the frequency is known before the multiply and LLR. I assume the #2 cut is left in place?

If so you would need something like 4 things with random accesss/in memory speed
1&2) frequency vectors for interactions pre user and item, these may be updated and are used to calculate LLR and also for cutting new update interactions.
3) cooccurrence matrix with LLR weights—this is also stored in search engine or DB (without weights) so any update needs to trigger an engine index update.
4) item dictionary

#3 might not be a matrix but a hashmap key = item-id, value = vector of items. If the vector has item keys = int you would also need the item dictionary.

Given the frequency vectors it seems like the interaction matrices are no longer needed?

On Apr 17, 2015, at 7:25 PM, Ted Dunning <te...@gmail.com> wrote:

Yes. Also add the fact that the nano batches are bounded tightly in size both max and mean. And mostly filtered away anyway. 

Aging is an open question. I have never seen any effect of alternative sampling so I would just assume "keep oldest" which just tosses more samples. Then occasionally rebuild from batch if you really want aging to go right.  

Search updates any more are true realtime also so that works very well. 

Sent from my iPhone

> On Apr 17, 2015, at 17:20, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
> Thanks. 
> 
> This idea is based on a micro-batch of interactions per update, not individual ones unless I missed something. That matches the typical input flow. Most interactions are filtered away by  frequency and number of interaction cuts.
> 
> A couple practical issues
> 
> In practice won’t this require aging of interactions too? So wouldn’t the update require some old interaction removal? I suppose this might just take the form of added null interactions representing the geriatric ones? Haven’t gone through the math with enough detail to see if you’ve already accounted for this.
> 
> To use actual math (self-join, etc.) we still need to alter the geometry of the interactions to have the same row rank as the adjusted total. In other words the number of rows in all resulting interactions must be the same. Over time this means completely removing rows and columns or allowing empty rows in potentially all input matrices.
> 
> Might not be too bad to accumulate gaps in rows and columns. Not sure if it would have a practical impact (to some large limit) as long as it was done, to keep the real size more or less fixed.
> 
> As to realtime, that would be under search engine control through incremental indexing and there are a couple ways to do that, not a problem afaik. As you point out the query always works and is real time. The index update must be frequent and not impact the engine's availability for queries.
> 
> On Apr 17, 2015, at 2:46 PM, Ted Dunning <te...@gmail.com> wrote:
> 
> 
> When I think of real-time adaptation of indicators, I think of this:
> 
> http://www.slideshare.net/tdunning/realtime-puppies-and-ponies-evolving-indicator-recommendations-in-realtime
> 
> 
>> On Fri, Apr 17, 2015 at 6:51 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>> I’ve been thinking about Streaming (continuous input) and incremental coccurrence.
>> 
>> As interactions stream in from the user it it fairly simple to use something like Spark streaming to maintain a moving time window for all input, and an update frequency that recalcs all input currently in the time window. I’ve done this with the current cooccurrence code but though streaming, this is not incremental.
>> 
>> The current data flow goes from interaction input to geometry and user dictionary reconciliation to A’A, A’B etc. After the multiply the resulting cooccurrence matrices are LLR weighted/filtered/down-sampled.
>> 
>> Incremental can mean all sorts of things and may imply different trade-offs. Did you have anything specific in mind?
> 
>

Re: Streaming and incremental cooccurrence

Posted by Andrew Musselman <an...@gmail.com>.

Cool

On Saturday, April 18, 2015, Ted Dunning <te...@gmail.com> wrote:

>
> Andrew
>
> Take a look at the slides I posted.  In them I showed that the update does
> not grow beyond a very reasonable bound.
>
> Sent from my iPhone
>
> > On Apr 18, 2015, at 9:15, Andrew Musselman <andrew.musselman@gmail.com
> <javascript:;>> wrote:
> >
> > Yes that's what I mean; if the number of updates gets too big it probably
> > would be unmanageable though.  This approach worked well with daily
> > updates, but never tried it with anything "real time."
> >
> >> On Saturday, April 18, 2015, Pat Ferrel <pat@occamsmachete.com
> <javascript:;>> wrote:
> >>
> >> I think you are saying that instead of val newHashMap = lastHashMap ++
> >> updateHashMap, layered updates might be useful since new and last are
> >> potentially large. Some limit of updates might trigger a refresh. This
> >> might work if the update works with incremental index updates in the
> search
> >> engine. Given practical considerations the updates will be numerous and
> >> nearly empty.
> >>
> >> On Apr 17, 2015, at 7:58 PM, Andrew Musselman <
> andrew.musselman@gmail.com <javascript:;>
> >> <javascript:;>> wrote:
> >>
> >> I have not implemented it for recommendations but a layered cache/sieve
> >> structure could be useful.
> >>
> >> That is, between batch refreshes you can keep tacking on new updates in
> a
> >> cascading order so values that are updated exist in the newest layer but
> >> otherwise the lookup goes for the latest updated layer.
> >>
> >> You can put a fractional multiplier on older layers for aging but again
> >> I've not implemented it.
> >>
> >> On Friday, April 17, 2015, Ted Dunning <ted.dunning@gmail.com
> <javascript:;>
> >> <javascript:;>> wrote:
> >>
> >>>
> >>> Yes. Also add the fact that the nano batches are bounded tightly in
> size
> >>> both max and mean. And mostly filtered away anyway.
> >>>
> >>> Aging is an open question. I have never seen any effect of alternative
> >>> sampling so I would just assume "keep oldest" which just tosses more
> >>> samples. Then occasionally rebuild from batch if you really want aging
> to
> >>> go right.
> >>>
> >>> Search updates any more are true realtime also so that works very well.
> >>>
> >>> Sent from my iPhone
> >>>
> >>>> On Apr 17, 2015, at 17:20, Pat Ferrel <pat@occamsmachete.com
> <javascript:;>
> >> <javascript:;>
> >>> <javascript:;>> wrote:
> >>>>
> >>>> Thanks.
> >>>>
> >>>> This idea is based on a micro-batch of interactions per update, not
> >>> individual ones unless I missed something. That matches the typical
> input
> >>> flow. Most interactions are filtered away by  frequency and number of
> >>> interaction cuts.
> >>>>
> >>>> A couple practical issues
> >>>>
> >>>> In practice won’t this require aging of interactions too? So wouldn’t
> >>> the update require some old interaction removal? I suppose this might
> >> just
> >>> take the form of added null interactions representing the geriatric
> ones?
> >>> Haven’t gone through the math with enough detail to see if you’ve
> already
> >>> accounted for this.
> >>>>
> >>>> To use actual math (self-join, etc.) we still need to alter the
> geometry
> >>> of the interactions to have the same row rank as the adjusted total. In
> >>> other words the number of rows in all resulting interactions must be
> the
> >>> same. Over time this means completely removing rows and columns or
> >> allowing
> >>> empty rows in potentially all input matrices.
> >>>>
> >>>> Might not be too bad to accumulate gaps in rows and columns. Not sure
> if
> >>> it would have a practical impact (to some large limit) as long as it
> was
> >>> done, to keep the real size more or less fixed.
> >>>>
> >>>> As to realtime, that would be under search engine control through
> >>> incremental indexing and there are a couple ways to do that, not a
> >> problem
> >>> afaik. As you point out the query always works and is real time. The
> >> index
> >>> update must be frequent and not impact the engine's availability for
> >>> queries.
> >>>>
> >>>> On Apr 17, 2015, at 2:46 PM, Ted Dunning <ted.dunning@gmail.com
> <javascript:;>
> >> <javascript:;>
> >>> <javascript:;>> wrote:
> >>>>
> >>>>
> >>>> When I think of real-time adaptation of indicators, I think of this:
> >>
> http://www.slideshare.net/tdunning/realtime-puppies-and-ponies-evolving-indicator-recommendations-in-realtime
> >>>>
> >>>>
> >>>>> On Fri, Apr 17, 2015 at 6:51 PM, Pat Ferrel <pat@occamsmachete.com
> <javascript:;>
> >> <javascript:;>
> >>> <javascript:;>> wrote:
> >>>>> I’ve been thinking about Streaming (continuous input) and incremental
> >>> coccurrence.
> >>>>>
> >>>>> As interactions stream in from the user it it fairly simple to use
> >>> something like Spark streaming to maintain a moving time window for all
> >>> input, and an update frequency that recalcs all input currently in the
> >> time
> >>> window. I’ve done this with the current cooccurrence code but though
> >>> streaming, this is not incremental.
> >>>>>
> >>>>> The current data flow goes from interaction input to geometry and
> user
> >>> dictionary reconciliation to A’A, A’B etc. After the multiply the
> >> resulting
> >>> cooccurrence matrices are LLR weighted/filtered/down-sampled.
> >>>>>
> >>>>> Incremental can mean all sorts of things and may imply different
> >>> trade-offs. Did you have anything specific in mind?
> >>
> >>
>

Re: Streaming and incremental cooccurrence

Posted by Ted Dunning <te...@gmail.com>.

Andrew 

Take a look at the slides I posted.  In them I showed that the update does not grow beyond a very reasonable bound. 

Sent from my iPhone

> On Apr 18, 2015, at 9:15, Andrew Musselman <an...@gmail.com> wrote:
> 
> Yes that's what I mean; if the number of updates gets too big it probably
> would be unmanageable though.  This approach worked well with daily
> updates, but never tried it with anything "real time."
> 
>> On Saturday, April 18, 2015, Pat Ferrel <pa...@occamsmachete.com> wrote:
>> 
>> I think you are saying that instead of val newHashMap = lastHashMap ++
>> updateHashMap, layered updates might be useful since new and last are
>> potentially large. Some limit of updates might trigger a refresh. This
>> might work if the update works with incremental index updates in the search
>> engine. Given practical considerations the updates will be numerous and
>> nearly empty.
>> 
>> On Apr 17, 2015, at 7:58 PM, Andrew Musselman <andrew.musselman@gmail.com
>> <javascript:;>> wrote:
>> 
>> I have not implemented it for recommendations but a layered cache/sieve
>> structure could be useful.
>> 
>> That is, between batch refreshes you can keep tacking on new updates in a
>> cascading order so values that are updated exist in the newest layer but
>> otherwise the lookup goes for the latest updated layer.
>> 
>> You can put a fractional multiplier on older layers for aging but again
>> I've not implemented it.
>> 
>> On Friday, April 17, 2015, Ted Dunning <ted.dunning@gmail.com
>> <javascript:;>> wrote:
>> 
>>> 
>>> Yes. Also add the fact that the nano batches are bounded tightly in size
>>> both max and mean. And mostly filtered away anyway.
>>> 
>>> Aging is an open question. I have never seen any effect of alternative
>>> sampling so I would just assume "keep oldest" which just tosses more
>>> samples. Then occasionally rebuild from batch if you really want aging to
>>> go right.
>>> 
>>> Search updates any more are true realtime also so that works very well.
>>> 
>>> Sent from my iPhone
>>> 
>>>> On Apr 17, 2015, at 17:20, Pat Ferrel <pat@occamsmachete.com
>> <javascript:;>
>>> <javascript:;>> wrote:
>>>> 
>>>> Thanks.
>>>> 
>>>> This idea is based on a micro-batch of interactions per update, not
>>> individual ones unless I missed something. That matches the typical input
>>> flow. Most interactions are filtered away by  frequency and number of
>>> interaction cuts.
>>>> 
>>>> A couple practical issues
>>>> 
>>>> In practice won’t this require aging of interactions too? So wouldn’t
>>> the update require some old interaction removal? I suppose this might
>> just
>>> take the form of added null interactions representing the geriatric ones?
>>> Haven’t gone through the math with enough detail to see if you’ve already
>>> accounted for this.
>>>> 
>>>> To use actual math (self-join, etc.) we still need to alter the geometry
>>> of the interactions to have the same row rank as the adjusted total. In
>>> other words the number of rows in all resulting interactions must be the
>>> same. Over time this means completely removing rows and columns or
>> allowing
>>> empty rows in potentially all input matrices.
>>>> 
>>>> Might not be too bad to accumulate gaps in rows and columns. Not sure if
>>> it would have a practical impact (to some large limit) as long as it was
>>> done, to keep the real size more or less fixed.
>>>> 
>>>> As to realtime, that would be under search engine control through
>>> incremental indexing and there are a couple ways to do that, not a
>> problem
>>> afaik. As you point out the query always works and is real time. The
>> index
>>> update must be frequent and not impact the engine's availability for
>>> queries.
>>>> 
>>>> On Apr 17, 2015, at 2:46 PM, Ted Dunning <ted.dunning@gmail.com
>> <javascript:;>
>>> <javascript:;>> wrote:
>>>> 
>>>> 
>>>> When I think of real-time adaptation of indicators, I think of this:
>> http://www.slideshare.net/tdunning/realtime-puppies-and-ponies-evolving-indicator-recommendations-in-realtime
>>>> 
>>>> 
>>>>> On Fri, Apr 17, 2015 at 6:51 PM, Pat Ferrel <pat@occamsmachete.com
>> <javascript:;>
>>> <javascript:;>> wrote:
>>>>> I’ve been thinking about Streaming (continuous input) and incremental
>>> coccurrence.
>>>>> 
>>>>> As interactions stream in from the user it it fairly simple to use
>>> something like Spark streaming to maintain a moving time window for all
>>> input, and an update frequency that recalcs all input currently in the
>> time
>>> window. I’ve done this with the current cooccurrence code but though
>>> streaming, this is not incremental.
>>>>> 
>>>>> The current data flow goes from interaction input to geometry and user
>>> dictionary reconciliation to A’A, A’B etc. After the multiply the
>> resulting
>>> cooccurrence matrices are LLR weighted/filtered/down-sampled.
>>>>> 
>>>>> Incremental can mean all sorts of things and may imply different
>>> trade-offs. Did you have anything specific in mind?
>> 
>>

Re: Streaming and incremental cooccurrence

Posted by Andrew Musselman <an...@gmail.com>.

Yes that's what I mean; if the number of updates gets too big it probably
would be unmanageable though.  This approach worked well with daily
updates, but never tried it with anything "real time."

On Saturday, April 18, 2015, Pat Ferrel <pa...@occamsmachete.com> wrote:

> I think you are saying that instead of val newHashMap = lastHashMap ++
> updateHashMap, layered updates might be useful since new and last are
> potentially large. Some limit of updates might trigger a refresh. This
> might work if the update works with incremental index updates in the search
> engine. Given practical considerations the updates will be numerous and
> nearly empty.
>
> On Apr 17, 2015, at 7:58 PM, Andrew Musselman <andrew.musselman@gmail.com
> <javascript:;>> wrote:
>
> I have not implemented it for recommendations but a layered cache/sieve
> structure could be useful.
>
> That is, between batch refreshes you can keep tacking on new updates in a
> cascading order so values that are updated exist in the newest layer but
> otherwise the lookup goes for the latest updated layer.
>
> You can put a fractional multiplier on older layers for aging but again
> I've not implemented it.
>
> On Friday, April 17, 2015, Ted Dunning <ted.dunning@gmail.com
> <javascript:;>> wrote:
>
> >
> > Yes. Also add the fact that the nano batches are bounded tightly in size
> > both max and mean. And mostly filtered away anyway.
> >
> > Aging is an open question. I have never seen any effect of alternative
> > sampling so I would just assume "keep oldest" which just tosses more
> > samples. Then occasionally rebuild from batch if you really want aging to
> > go right.
> >
> > Search updates any more are true realtime also so that works very well.
> >
> > Sent from my iPhone
> >
> >> On Apr 17, 2015, at 17:20, Pat Ferrel <pat@occamsmachete.com
> <javascript:;>
> > <javascript:;>> wrote:
> >>
> >> Thanks.
> >>
> >> This idea is based on a micro-batch of interactions per update, not
> > individual ones unless I missed something. That matches the typical input
> > flow. Most interactions are filtered away by  frequency and number of
> > interaction cuts.
> >>
> >> A couple practical issues
> >>
> >> In practice won’t this require aging of interactions too? So wouldn’t
> > the update require some old interaction removal? I suppose this might
> just
> > take the form of added null interactions representing the geriatric ones?
> > Haven’t gone through the math with enough detail to see if you’ve already
> > accounted for this.
> >>
> >> To use actual math (self-join, etc.) we still need to alter the geometry
> > of the interactions to have the same row rank as the adjusted total. In
> > other words the number of rows in all resulting interactions must be the
> > same. Over time this means completely removing rows and columns or
> allowing
> > empty rows in potentially all input matrices.
> >>
> >> Might not be too bad to accumulate gaps in rows and columns. Not sure if
> > it would have a practical impact (to some large limit) as long as it was
> > done, to keep the real size more or less fixed.
> >>
> >> As to realtime, that would be under search engine control through
> > incremental indexing and there are a couple ways to do that, not a
> problem
> > afaik. As you point out the query always works and is real time. The
> index
> > update must be frequent and not impact the engine's availability for
> > queries.
> >>
> >> On Apr 17, 2015, at 2:46 PM, Ted Dunning <ted.dunning@gmail.com
> <javascript:;>
> > <javascript:;>> wrote:
> >>
> >>
> >> When I think of real-time adaptation of indicators, I think of this:
> >>
> >>
> >
> http://www.slideshare.net/tdunning/realtime-puppies-and-ponies-evolving-indicator-recommendations-in-realtime
> >>
> >>
> >>> On Fri, Apr 17, 2015 at 6:51 PM, Pat Ferrel <pat@occamsmachete.com
> <javascript:;>
> > <javascript:;>> wrote:
> >>> I’ve been thinking about Streaming (continuous input) and incremental
> > coccurrence.
> >>>
> >>> As interactions stream in from the user it it fairly simple to use
> > something like Spark streaming to maintain a moving time window for all
> > input, and an update frequency that recalcs all input currently in the
> time
> > window. I’ve done this with the current cooccurrence code but though
> > streaming, this is not incremental.
> >>>
> >>> The current data flow goes from interaction input to geometry and user
> > dictionary reconciliation to A’A, A’B etc. After the multiply the
> resulting
> > cooccurrence matrices are LLR weighted/filtered/down-sampled.
> >>>
> >>> Incremental can mean all sorts of things and may imply different
> > trade-offs. Did you have anything specific in mind?
> >>
> >>
> >
>
>

Re: Streaming and incremental cooccurrence

Posted by Pat Ferrel <pa...@occamsmachete.com>.

I think you are saying that instead of val newHashMap = lastHashMap ++ updateHashMap, layered updates might be useful since new and last are potentially large. Some limit of updates might trigger a refresh. This might work if the update works with incremental index updates in the search engine. Given practical considerations the updates will be numerous and nearly empty.

On Apr 17, 2015, at 7:58 PM, Andrew Musselman <an...@gmail.com> wrote:

I have not implemented it for recommendations but a layered cache/sieve
structure could be useful.

That is, between batch refreshes you can keep tacking on new updates in a
cascading order so values that are updated exist in the newest layer but
otherwise the lookup goes for the latest updated layer.

You can put a fractional multiplier on older layers for aging but again
I've not implemented it.

On Friday, April 17, 2015, Ted Dunning <te...@gmail.com> wrote:

> 
> Yes. Also add the fact that the nano batches are bounded tightly in size
> both max and mean. And mostly filtered away anyway.
> 
> Aging is an open question. I have never seen any effect of alternative
> sampling so I would just assume "keep oldest" which just tosses more
> samples. Then occasionally rebuild from batch if you really want aging to
> go right.
> 
> Search updates any more are true realtime also so that works very well.
> 
> Sent from my iPhone
> 
>> On Apr 17, 2015, at 17:20, Pat Ferrel <pat@occamsmachete.com
> <javascript:;>> wrote:
>> 
>> Thanks.
>> 
>> This idea is based on a micro-batch of interactions per update, not
> individual ones unless I missed something. That matches the typical input
> flow. Most interactions are filtered away by  frequency and number of
> interaction cuts.
>> 
>> A couple practical issues
>> 
>> In practice won’t this require aging of interactions too? So wouldn’t
> the update require some old interaction removal? I suppose this might just
> take the form of added null interactions representing the geriatric ones?
> Haven’t gone through the math with enough detail to see if you’ve already
> accounted for this.
>> 
>> To use actual math (self-join, etc.) we still need to alter the geometry
> of the interactions to have the same row rank as the adjusted total. In
> other words the number of rows in all resulting interactions must be the
> same. Over time this means completely removing rows and columns or allowing
> empty rows in potentially all input matrices.
>> 
>> Might not be too bad to accumulate gaps in rows and columns. Not sure if
> it would have a practical impact (to some large limit) as long as it was
> done, to keep the real size more or less fixed.
>> 
>> As to realtime, that would be under search engine control through
> incremental indexing and there are a couple ways to do that, not a problem
> afaik. As you point out the query always works and is real time. The index
> update must be frequent and not impact the engine's availability for
> queries.
>> 
>> On Apr 17, 2015, at 2:46 PM, Ted Dunning <ted.dunning@gmail.com
> <javascript:;>> wrote:
>> 
>> 
>> When I think of real-time adaptation of indicators, I think of this:
>> 
>> 
> http://www.slideshare.net/tdunning/realtime-puppies-and-ponies-evolving-indicator-recommendations-in-realtime
>> 
>> 
>>> On Fri, Apr 17, 2015 at 6:51 PM, Pat Ferrel <pat@occamsmachete.com
> <javascript:;>> wrote:
>>> I’ve been thinking about Streaming (continuous input) and incremental
> coccurrence.
>>> 
>>> As interactions stream in from the user it it fairly simple to use
> something like Spark streaming to maintain a moving time window for all
> input, and an update frequency that recalcs all input currently in the time
> window. I’ve done this with the current cooccurrence code but though
> streaming, this is not incremental.
>>> 
>>> The current data flow goes from interaction input to geometry and user
> dictionary reconciliation to A’A, A’B etc. After the multiply the resulting
> cooccurrence matrices are LLR weighted/filtered/down-sampled.
>>> 
>>> Incremental can mean all sorts of things and may imply different
> trade-offs. Did you have anything specific in mind?
>> 
>> 
>

Re: Streaming and incremental cooccurrence

Posted by Andrew Musselman <an...@gmail.com>.

I have not implemented it for recommendations but a layered cache/sieve
structure could be useful.

That is, between batch refreshes you can keep tacking on new updates in a
cascading order so values that are updated exist in the newest layer but
otherwise the lookup goes for the latest updated layer.

You can put a fractional multiplier on older layers for aging but again
I've not implemented it.

On Friday, April 17, 2015, Ted Dunning <te...@gmail.com> wrote:

>
> Yes. Also add the fact that the nano batches are bounded tightly in size
> both max and mean. And mostly filtered away anyway.
>
> Aging is an open question. I have never seen any effect of alternative
> sampling so I would just assume "keep oldest" which just tosses more
> samples. Then occasionally rebuild from batch if you really want aging to
> go right.
>
> Search updates any more are true realtime also so that works very well.
>
> Sent from my iPhone
>
> > On Apr 17, 2015, at 17:20, Pat Ferrel <pat@occamsmachete.com
> <javascript:;>> wrote:
> >
> > Thanks.
> >
> > This idea is based on a micro-batch of interactions per update, not
> individual ones unless I missed something. That matches the typical input
> flow. Most interactions are filtered away by  frequency and number of
> interaction cuts.
> >
> > A couple practical issues
> >
> > In practice won’t this require aging of interactions too? So wouldn’t
> the update require some old interaction removal? I suppose this might just
> take the form of added null interactions representing the geriatric ones?
> Haven’t gone through the math with enough detail to see if you’ve already
> accounted for this.
> >
> > To use actual math (self-join, etc.) we still need to alter the geometry
> of the interactions to have the same row rank as the adjusted total. In
> other words the number of rows in all resulting interactions must be the
> same. Over time this means completely removing rows and columns or allowing
> empty rows in potentially all input matrices.
> >
> > Might not be too bad to accumulate gaps in rows and columns. Not sure if
> it would have a practical impact (to some large limit) as long as it was
> done, to keep the real size more or less fixed.
> >
> > As to realtime, that would be under search engine control through
> incremental indexing and there are a couple ways to do that, not a problem
> afaik. As you point out the query always works and is real time. The index
> update must be frequent and not impact the engine's availability for
> queries.
> >
> > On Apr 17, 2015, at 2:46 PM, Ted Dunning <ted.dunning@gmail.com
> <javascript:;>> wrote:
> >
> >
> > When I think of real-time adaptation of indicators, I think of this:
> >
> >
> http://www.slideshare.net/tdunning/realtime-puppies-and-ponies-evolving-indicator-recommendations-in-realtime
> >
> >
> >> On Fri, Apr 17, 2015 at 6:51 PM, Pat Ferrel <pat@occamsmachete.com
> <javascript:;>> wrote:
> >> I’ve been thinking about Streaming (continuous input) and incremental
> coccurrence.
> >>
> >> As interactions stream in from the user it it fairly simple to use
> something like Spark streaming to maintain a moving time window for all
> input, and an update frequency that recalcs all input currently in the time
> window. I’ve done this with the current cooccurrence code but though
> streaming, this is not incremental.
> >>
> >> The current data flow goes from interaction input to geometry and user
> dictionary reconciliation to A’A, A’B etc. After the multiply the resulting
> cooccurrence matrices are LLR weighted/filtered/down-sampled.
> >>
> >> Incremental can mean all sorts of things and may imply different
> trade-offs. Did you have anything specific in mind?
> >
> >
>

Re: Streaming and incremental cooccurrence

Posted by Ted Dunning <te...@gmail.com>.

Yes. Also add the fact that the nano batches are bounded tightly in size both max and mean. And mostly filtered away anyway. 

Aging is an open question. I have never seen any effect of alternative sampling so I would just assume "keep oldest" which just tosses more samples. Then occasionally rebuild from batch if you really want aging to go right.  

Search updates any more are true realtime also so that works very well. 

Sent from my iPhone

> On Apr 17, 2015, at 17:20, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
> Thanks. 
> 
> This idea is based on a micro-batch of interactions per update, not individual ones unless I missed something. That matches the typical input flow. Most interactions are filtered away by  frequency and number of interaction cuts.
> 
> A couple practical issues
> 
> In practice won’t this require aging of interactions too? So wouldn’t the update require some old interaction removal? I suppose this might just take the form of added null interactions representing the geriatric ones? Haven’t gone through the math with enough detail to see if you’ve already accounted for this.
> 
> To use actual math (self-join, etc.) we still need to alter the geometry of the interactions to have the same row rank as the adjusted total. In other words the number of rows in all resulting interactions must be the same. Over time this means completely removing rows and columns or allowing empty rows in potentially all input matrices.
> 
> Might not be too bad to accumulate gaps in rows and columns. Not sure if it would have a practical impact (to some large limit) as long as it was done, to keep the real size more or less fixed.
> 
> As to realtime, that would be under search engine control through incremental indexing and there are a couple ways to do that, not a problem afaik. As you point out the query always works and is real time. The index update must be frequent and not impact the engine's availability for queries.
> 
> On Apr 17, 2015, at 2:46 PM, Ted Dunning <te...@gmail.com> wrote:
> 
> 
> When I think of real-time adaptation of indicators, I think of this:
> 
> http://www.slideshare.net/tdunning/realtime-puppies-and-ponies-evolving-indicator-recommendations-in-realtime
> 
> 
>> On Fri, Apr 17, 2015 at 6:51 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>> I’ve been thinking about Streaming (continuous input) and incremental coccurrence.
>> 
>> As interactions stream in from the user it it fairly simple to use something like Spark streaming to maintain a moving time window for all input, and an update frequency that recalcs all input currently in the time window. I’ve done this with the current cooccurrence code but though streaming, this is not incremental.
>> 
>> The current data flow goes from interaction input to geometry and user dictionary reconciliation to A’A, A’B etc. After the multiply the resulting cooccurrence matrices are LLR weighted/filtered/down-sampled.
>> 
>> Incremental can mean all sorts of things and may imply different trade-offs. Did you have anything specific in mind?
> 
>

Re: Streaming and incremental cooccurrence

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Thanks.

This idea is based on a micro-batch of interactions per update, not individual ones unless I missed something. That matches the typical input flow. Most interactions are filtered away by frequency and number of interaction cuts.

A couple practical issues

In practice won’t this require aging of interactions too? So wouldn’t the update require some old interaction removal? I suppose this might just take the form of added null interactions representing the geriatric ones? Haven’t gone through the math with enough detail to see if you’ve already accounted for this.

To use actual math (self-join, etc.) we still need to alter the geometry of the interactions to have the same row rank as the adjusted total. In other words the number of rows in all resulting interactions must be the same. Over time this means completely removing rows and columns or allowing empty rows in potentially all input matrices.

Might not be too bad to accumulate gaps in rows and columns. Not sure if it would have a practical impact (to some large limit) as long as it was done, to keep the real size more or less fixed.

As to realtime, that would be under search engine control through incremental indexing and there are a couple ways to do that, not a problem afaik. As you point out the query always works and is real time. The index update must be frequent and not impact the engine's availability for queries.

On Apr 17, 2015, at 2:46 PM, Ted Dunning <te...@gmail.com> wrote:

When I think of real-time adaptation of indicators, I think of this:

http://www.slideshare.net/tdunning/realtime-puppies-and-ponies-evolving-indicator-recommendations-in-realtime <http://www.slideshare.net/tdunning/realtime-puppies-and-ponies-evolving-indicator-recommendations-in-realtime>

On Fri, Apr 17, 2015 at 6:51 PM, Pat Ferrel <pat@occamsmachete.com <ma...@occamsmachete.com>> wrote:
I’ve been thinking about Streaming (continuous input) and incremental coccurrence.

As interactions stream in from the user it it fairly simple to use something like Spark streaming to maintain a moving time window for all input, and an update frequency that recalcs all input currently in the time window. I’ve done this with the current cooccurrence code but though streaming, this is not incremental.

The current data flow goes from interaction input to geometry and user dictionary reconciliation to A’A, A’B etc. After the multiply the resulting cooccurrence matrices are LLR weighted/filtered/down-sampled.

Incremental can mean all sorts of things and may imply different trade-offs. Did you have anything specific in mind?

Re: Streaming and incremental cooccurrence

Posted by Ted Dunning <te...@gmail.com>.

When I think of real-time adaptation of indicators, I think of this:

http://www.slideshare.net/tdunning/realtime-puppies-and-ponies-evolving-indicator-recommendations-in-realtime


On Fri, Apr 17, 2015 at 6:51 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> I’ve been thinking about Streaming (continuous input) and incremental
> coccurrence.
>
> As interactions stream in from the user it it fairly simple to use
> something like Spark streaming to maintain a moving time window for all
> input, and an update frequency that recalcs all input currently in the time
> window. I’ve done this with the current cooccurrence code but though
> streaming, this is not incremental.
>
> The current data flow goes from interaction input to geometry and user
> dictionary reconciliation to A’A, A’B etc. After the multiply the resulting
> cooccurrence matrices are LLR weighted/filtered/down-sampled.
>
> Incremental can mean all sorts of things and may imply different
> trade-offs. Did you have anything specific in mind?