You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@predictionio.apache.org by Noelia Osés Fernández <no...@vicomtech.org> on 2017/11/16 15:26:37 UTC

Log-likelihood based correlation test?

Hi,

I've been trying to understand how the UR algorithm works and I think I
have a general idea. But I would like to have a *mathematical description*
of the step in which the LLR comes into play. In the CCO presentations I
have found it says:

(PtP) compares column to column using
*log-likelihood based correlation test*

However, I have searched for "log-likelihood based correlation test" in
google but no joy. All I get are explanations of the likelihood-ratio test
to compare two models.

I would very much appreciate a math explanation of log-likelihood based
correlation test. Any pointers to papers or any other literature that
explains this specifically are much appreciated.

Best regards,
Noelia

Re: Log-likelihood based correlation test?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Yes, this will show the model. But if you do this a lot there are tools like Restlet that you plug in to Chrome. They will allow you to build queries of all sorts. For instance 
GET http://localhost:9200/urindex/_search?pretty 

will show the item rows of the UR model put into the index for the integration test data. The UI is a bit obtuse but you can scroll down in the right pane expanding bits of JSON as you go to see this:

"hits":{
"total": 7,
"max_score": 1,
"hits":[
{
"_index": "urindex_1511033890025",
"_type": "items",
"_id": "Nexus",
"_score": 1,
"_source":{
"defaultRank": 4,
"expires": "2017-11-04T19:01:23.655-07:00",
"countries":["United States", "Canada"],
"id": "Nexus",
"date": "2017-11-02T19:01:23.655-07:00",
"category-pref":["tablets"],
"categories":["Tablets", "Electronics", "Google"],
"available": "2017-10-31T19:01:23.655-07:00",
"purchase":[],
"popRank": 2,
"view":["Tablets"]
}
},

As you can see no purchased items survived the correlation test, one survived the view and category-pref correlation tests. The other fields are item properties set using $set events and are used with business rules.

 With something like this tool you can even take the query logged in the deployed PIO server and send it to see how the query is constructed and what the results are (same as you get from the SDK I’ll wager :-)



On Nov 20, 2017, at 7:07 AM, Daniel Gabrieli <dg...@salesforce.com> wrote:

There is a REST client for Elasticsearch and bindings in many popular languages but to get started quickly I found this commands helpful:

List Indices:

curl -XGET 'localhost:9200/_cat/indices?v&pretty'

Get some documents from an index:

curl -XGET 'localhost:9200/<INDEX>/_search?q=*&pretty'

Then look at the "_source" in the document to see what values are associated with the document.

More info here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-get.html#_source <https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-get.html#_source>

this might also be helpful to work through a single specific query:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html <https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html>





On Mon, Nov 20, 2017 at 9:49 AM Noelia Osés Fernández <noses@vicomtech.org <ma...@vicomtech.org>> wrote:
Thanks Daniel!

And excuse my ignorance but... how do you inspect the ES index?

On 20 November 2017 at 15:29, Daniel Gabrieli <dgabrieli@salesforce.com <ma...@salesforce.com>> wrote:
There is this cli tool and article with more information that does produce scores:

https://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html <https://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html>

But I don't know of any commands that return diagnostics about LLR from the PIO framework / UR engine.  That would be a nice feature if it doesn't exist.  The way I've gotten some insight into what the model is doing is by when using PIO / UR is by inspecting the the ElasticSearch index that gets created because it has the "significant" values populated in the documents (though not the actual LLR scores).

On Mon, Nov 20, 2017 at 7:22 AM Noelia Osés Fernández <noses@vicomtech.org <ma...@vicomtech.org>> wrote:
This thread is very enlightening, thank you very much!

Is there a way I can see what the P, PtP, and PtL matrices of an app are? In the handmade case, for example?

Are there any pio calls I can use to get these?

On 17 November 2017 at 19:52, Pat Ferrel <pat@occamsmachete.com <ma...@occamsmachete.com>> wrote:
Mahout builds the model by doing matrix multiplication (PtP) then calculating the LLR score for every non-zero value. We then keep the top K or use a threshold to decide whether to keep of not (both are supported in the UR). LLR is a metric for seeing how likely 2 events in a large group are correlated. Therefore LLR is only used to remove weak data from the model.

So Mahout builds the model then it is put into Elasticsearch which is used as a KNN (K-nearest Neighbors) engine. The LLR score is not put into the model only an indicator that the item survived the LLR test.

The KNN is applied using the user’s history as the query and finding items the most closely match it. Since PtP will have items in rows and the row will have correlating items, this “search” methods work quite well to find items that had very similar items purchased with it as are in the user’s history.

=============================== that is the simple explanation ========================================

Item-based recs take the model items (correlated items by the LLR test) as the query and the results are the most similar items—the items with most similar correlating items.

The model is items in rows and items in columns if you are only using one event. PtP. If you think it through, it is all purchased items in as the row key and other items purchased along with the row key. LLR filters out the weakly correlating non-zero values (0 mean no evidence of correlation anyway). If we didn’t do this it would be purely a “Cooccurrence” recommender, one of the first useful ones. But filtering based on cooccurrence strength (PtP values without LLR applied to them) produces much worse results than using LLR to filter for most highly correlated cooccurrences. You get a similar effect with Matrix Factorization but you can only use one type of event for various reasons.

Since LLR is a probabilistic metric that only looks at counts, it can be applied equally well to PtV (purchase, view), PtS (purchase, search terms), PtC (purchase, category-preferences). We did an experiment using Mean Average Precision for the UR using video “Likes” vs “Likes” and “Dislikes” so LtL vs. LtL and LtD scraped from rottentomatoes.com <http://rottentomatoes.com/> reviews and got a 20% lift in the MAP@k score by including data for “Dislikes”. https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/ <https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/>

So the benefit and use of LLR is to filter weak data from the model and allow us to see if dislikes, and other events, correlate with likes. Adding this type of data, that is usually thrown away is one the the most powerful reasons to use the algorithm—BTW the algorithm is called Correlated Cross-Occurrence (CCO).

The benefit of using Lucene (at the heart of Elasticsearch) to do the KNN query is that is it fast, taking the user’s realtime events into the query but also because it is is trivial to add all sorts or business rules. like give me recs based on user events but only ones from a certain category, of give me recs but only ones tagged as “in-stock” in fact the business rules can have inclusion rules, exclusion rules, and be mixed with ANDs and ORs.

BTW there is a version ready for testing with PIO 0.12.0 and ES5 here: https://github.com/actionml/universal-recommender/tree/0.7.0-SNAPSHOT <https://github.com/actionml/universal-recommender/tree/0.7.0-SNAPSHOT> Instructions in the readme and notice it is in the 0.7.0-SNAPSHOT branch.


On Nov 17, 2017, at 7:59 AM, Andrew Troemner <atroemner@salesforce.com <ma...@salesforce.com>> wrote:

I'll echo Dan here. He and I went through the raw Mahout libraries called by the Universal Recommender, and while Noelia's description is accurate for an intermediate step, the indexing via ElasticSearch generates some separate relevancy scores based on their Lucene indexing scheme. The raw LLR scores are used in building this process, but the final scores served up by the API's should be post-processed, and cannot be used to reconstruct the raw LLR's (to my understanding).

There are also some additional steps including down-sampling, which scrubs out very rare combinations (which otherwise would have very high LLR's for a single observation), which partially corrects for the statistical problem of multiple detection. But the underlying logic is per Ted Dunning's research and summarized by Noelia, and is a solid way to approach interaction effects for tens of thousands of items and including secondary indicators (like demographics, or implicit preferences).

ANDREW TROEMNER
Associate Principal Data Scientist | salesforce.com <http://salesforce.com/>
Office: 317.832.4404 <tel:(317)%20832-4404>
Mobile: 317.531.0216 <tel:(317)%20531-0216>


 <http://smart.salesforce.com/sig/atroemner//us_mb_kb/default/link.html>
On Fri, Nov 17, 2017 at 9:55 AM, Daniel Gabrieli <dgabrieli@salesforce.com <ma...@salesforce.com>> wrote:
Maybe someone can correct me if I am wrong but in the code I believe Elasticsearch is used instead of "resulting LLR is what goes into the AB element in matrix PtP or PtL."

By default the strongest 50 LLR scores get set as searchable values in Elasticsearch per item-event pair.

You can configure the thresholds for significance using the configuration parameters: maxCorrelatorsPerItem or minLLR.  And this configuration is important because at default of 50 you may end up treating all "indicator values" as significant.  More info here: http://actionml.com/docs/ur_config <http://actionml.com/docs/ur_config>



On Fri, Nov 17, 2017 at 4:50 AM Noelia Osés Fernández <noses@vicomtech.org <ma...@vicomtech.org>> wrote:

Let's see if I've understood how LLR is used in UR. Let P be the matrix for the primary conversion indicator (say purchases) and Pt its transposed. 

Then, with a second matrix, which can be P again to make PtP or a matrix for a secondary indicator (say L for likes) to make PtL, we take a row from Pt (item A) and a column from the second matrix (either P or L, in this example) (item B) and we calculate the table that Ted Dunning explains on his webpage: the number of coocurrences that item A AND B have been purchased (or purchased AND liked), the number of times that item A OR B have been purchased (or purchased OR liked), and the number of times that neither item A nor B have been purchased (or purchased or liked). With this counts we calculate LLR following the formulas that Ted Dunning provides and the resulting LLR is what goes into the AB element in matrix PtP or PtL. Correct?   

Thank you!

On 16 November 2017 at 17:03, Noelia Osés Fernández <noses@vicomtech.org <ma...@vicomtech.org>> wrote:
Wonderful! Thanks Daniel!

Suneel, I'm still new to the Apache ecosystem and so I know that Mahout is used but only vaguely... I still don't know the different parts well enough to have a good understanding of what each of them do (Spark, MLLib, PIO, Mahout,...)

Thank you both!

On 16 November 2017 at 16:59, Suneel Marthi <smarthi@apache.org <ma...@apache.org>> wrote:
Indeed so. Ted Dunning is an Apache Mahout PMC and committer and the whole idea of Search-based Recommenders stems from his work and insights.  If u didn't know, the PIO UR uses Apache Mahout under the hood and hence u see the LLR.

On Thu, Nov 16, 2017 at 3:49 PM, Daniel Gabrieli <dgabrieli@salesforce.com <ma...@salesforce.com>> wrote:
I am pretty sure the LLR stuff in UR is based off of this blog post and associated paper:

http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html <http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html>

Accurate Methods for the Statistics of Surprise and Coincidence
by Ted Dunning

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962 <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962>


On Thu, Nov 16, 2017 at 10:26 AM Noelia Osés Fernández <noses@vicomtech.org <ma...@vicomtech.org>> wrote:
Hi,

I've been trying to understand how the UR algorithm works and I think I have a general idea. But I would like to have a mathematical description of the step in which the LLR comes into play. In the CCO presentations I have found it says:

(PtP) compares column to column using log-likelihood based correlation test


However, I have searched for "log-likelihood based correlation test" in google but no joy. All I get are explanations of the likelihood-ratio test to compare two models. 

I would very much appreciate a math explanation of log-likelihood based correlation test. Any pointers to papers or any other literature that explains this specifically are much appreciated.

Best regards,
Noelia












-- 
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-user+unsubscribe@googlegroups.com <ma...@googlegroups.com>.
To post to this group, send email to actionml-user@googlegroups.com <ma...@googlegroups.com>.
To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/CAA2BRS%2Boj%2BNYDmsNNd2mYM1ZC5CgWwC71W3%3DEhrO9qeOiKyWXA%40mail.gmail.com <https://groups.google.com/d/msgid/actionml-user/CAA2BRS%2Boj%2BNYDmsNNd2mYM1ZC5CgWwC71W3%3DEhrO9qeOiKyWXA%40mail.gmail.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.




-- 
 <http://www.vicomtech.org/>

Noelia Osés Fernández, PhD
Senior Researcher |
Investigadora Senior

noses@vicomtech.org <ma...@vicomtech.org>
+[34] 943 30 92 30 <tel:+34%20943%2030%2092%2030>
Data Intelligence for Energy and
Industrial Processes | Inteligencia
de Datos para Energía y Procesos
Industriales

 <https://www.linkedin.com/company/vicomtech>  <https://www.youtube.com/user/VICOMTech>  <ht...@Vicomtech_IK4>

member of:  <http://www.graphicsmedia.net/>     <http://www.ik4.es/>

Legal Notice - Privacy policy <http://www.vicomtech.org/en/proteccion-datos>


-- 
 <http://www.vicomtech.org/>

Noelia Osés Fernández, PhD
Senior Researcher |
Investigadora Senior

noses@vicomtech.org <ma...@vicomtech.org>
+[34] 943 30 92 30 <tel:+34%20943%2030%2092%2030>
Data Intelligence for Energy and
Industrial Processes | Inteligencia
de Datos para Energía y Procesos
Industriales

 <https://www.linkedin.com/company/vicomtech>  <https://www.youtube.com/user/VICOMTech>  <ht...@Vicomtech_IK4>

member of:  <http://www.graphicsmedia.net/>     <http://www.ik4.es/>

Legal Notice - Privacy policy <http://www.vicomtech.org/en/proteccion-datos>

-- 
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-user+unsubscribe@googlegroups.com <ma...@googlegroups.com>.
To post to this group, send email to actionml-user@googlegroups.com <ma...@googlegroups.com>.
To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/CAEWeDuxNWdF6qb_1mRGLKv9JBjf2Ggaqqa%2BOkRaEOAkgb%3D39Cw%40mail.gmail.com <https://groups.google.com/d/msgid/actionml-user/CAEWeDuxNWdF6qb_1mRGLKv9JBjf2Ggaqqa%2BOkRaEOAkgb%3D39Cw%40mail.gmail.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.

Re: Log-likelihood based correlation test?

Posted by Daniel Gabrieli <dg...@salesforce.com>.

There is a REST client for Elasticsearch and bindings in many popular
languages but to get started quickly I found this commands helpful:

List Indices:

curl -XGET 'localhost:9200/_cat/indices?v&pretty'

Get some documents from an index:

curl -XGET 'localhost:9200/<INDEX>/_search?q=*&pretty'

Then look at the "_source" in the document to see what values are
associated with the document.

More info here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-get.html#_source

this might also be helpful to work through a single specific query:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html





On Mon, Nov 20, 2017 at 9:49 AM Noelia Osés Fernández <no...@vicomtech.org>
wrote:

> Thanks Daniel!
>
> And excuse my ignorance but... how do you inspect the ES index?
>
> On 20 November 2017 at 15:29, Daniel Gabrieli <dg...@salesforce.com>
> wrote:
>
>> There is this cli tool and article with more information that does
>> produce scores:
>>
>> https://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html
>>
>> But I don't know of any commands that return diagnostics about LLR from
>> the PIO framework / UR engine.  That would be a nice feature if it doesn't
>> exist.  The way I've gotten some insight into what the model is doing is by
>> when using PIO / UR is by inspecting the the ElasticSearch index that gets
>> created because it has the "significant" values populated in the documents
>> (though not the actual LLR scores).
>>
>> On Mon, Nov 20, 2017 at 7:22 AM Noelia Osés Fernández <
>> noses@vicomtech.org> wrote:
>>
>>> This thread is very enlightening, thank you very much!
>>>
>>> Is there a way I can see what the P, PtP, and PtL matrices of an app
>>> are? In the handmade case, for example?
>>>
>>> Are there any pio calls I can use to get these?
>>>
>>> On 17 November 2017 at 19:52, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>>
>>>> Mahout builds the model by doing matrix multiplication (PtP) then
>>>> calculating the LLR score for every non-zero value. We then keep the top K
>>>> or use a threshold to decide whether to keep of not (both are supported in
>>>> the UR). LLR is a metric for seeing how likely 2 events in a large group
>>>> are correlated. Therefore LLR is only used to remove weak data from the
>>>> model.
>>>>
>>>> So Mahout builds the model then it is put into Elasticsearch which is
>>>> used as a KNN (K-nearest Neighbors) engine. The LLR score is not put into
>>>> the model only an indicator that the item survived the LLR test.
>>>>
>>>> The KNN is applied using the user’s history as the query and finding
>>>> items the most closely match it. Since PtP will have items in rows and the
>>>> row will have correlating items, this “search” methods work quite well to
>>>> find items that had very similar items purchased with it as are in the
>>>> user’s history.
>>>>
>>>> =============================== that is the simple explanation
>>>> ========================================
>>>>
>>>> Item-based recs take the model items (correlated items by the LLR test)
>>>> as the query and the results are the most similar items—the items with most
>>>> similar correlating items.
>>>>
>>>> The model is items in rows and items in columns if you are only using
>>>> one event. PtP. If you think it through, it is all purchased items in as
>>>> the row key and other items purchased along with the row key. LLR filters
>>>> out the weakly correlating non-zero values (0 mean no evidence of
>>>> correlation anyway). If we didn’t do this it would be purely a
>>>> “Cooccurrence” recommender, one of the first useful ones. But filtering
>>>> based on cooccurrence strength (PtP values without LLR applied to them)
>>>> produces much worse results than using LLR to filter for most highly
>>>> correlated cooccurrences. You get a similar effect with Matrix
>>>> Factorization but you can only use one type of event for various reasons.
>>>>
>>>> Since LLR is a probabilistic metric that only looks at counts, it can
>>>> be applied equally well to PtV (purchase, view), PtS (purchase, search
>>>> terms), PtC (purchase, category-preferences). We did an experiment using
>>>> Mean Average Precision for the UR using video “Likes” vs “Likes” and
>>>> “Dislikes” so LtL vs. LtL and LtD scraped from rottentomatoes.com
>>>> reviews and got a 20% lift in the MAP@k score by including data for
>>>> “Dislikes”.
>>>> https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/
>>>>
>>>> So the benefit and use of LLR is to filter weak data from the model and
>>>> allow us to see if dislikes, and other events, correlate with likes. Adding
>>>> this type of data, that is usually thrown away is one the the most powerful
>>>> reasons to use the algorithm—BTW the algorithm is called Correlated
>>>> Cross-Occurrence (CCO).
>>>>
>>>> The benefit of using Lucene (at the heart of Elasticsearch) to do the
>>>> KNN query is that is it fast, taking the user’s realtime events into the
>>>> query but also because it is is trivial to add all sorts or business rules.
>>>> like give me recs based on user events but only ones from a certain
>>>> category, of give me recs but only ones tagged as “in-stock” in fact the
>>>> business rules can have inclusion rules, exclusion rules, and be mixed with
>>>> ANDs and ORs.
>>>>
>>>> BTW there is a version ready for testing with PIO 0.12.0 and ES5 here:
>>>> https://github.com/actionml/universal-recommender/tree/0.7.0-SNAPSHOT Instructions
>>>> in the readme and notice it is in the 0.7.0-SNAPSHOT branch.
>>>>
>>>>
>>>> On Nov 17, 2017, at 7:59 AM, Andrew Troemner <at...@salesforce.com>
>>>> wrote:
>>>>
>>>> I'll echo Dan here. He and I went through the raw Mahout libraries
>>>> called by the Universal Recommender, and while Noelia's description is
>>>> accurate for an intermediate step, the indexing via ElasticSearch generates
>>>> some separate relevancy scores based on their Lucene indexing scheme. The
>>>> raw LLR scores are used in building this process, but the final scores
>>>> served up by the API's should be post-processed, and cannot be used to
>>>> reconstruct the raw LLR's (to my understanding).
>>>>
>>>> There are also some additional steps including down-sampling, which
>>>> scrubs out very rare combinations (which otherwise would have very high
>>>> LLR's for a single observation), which partially corrects for the
>>>> statistical problem of multiple detection. But the underlying logic is per
>>>> Ted Dunning's research and summarized by Noelia, and is a solid way to
>>>> approach interaction effects for tens of thousands of items and including
>>>> secondary indicators (like demographics, or implicit preferences).
>>>>
>>>>
>>>> *ANDREW TROEMNER*Associate Principal Data Scientist | salesforce.com
>>>> Office: 317.832.4404 <(317)%20832-4404>
>>>> Mobile: 317.531.0216 <(317)%20531-0216>
>>>>
>>>>
>>>>
>>>> <http://smart.salesforce.com/sig/atroemner//us_mb_kb/default/link.html>
>>>>
>>>> On Fri, Nov 17, 2017 at 9:55 AM, Daniel Gabrieli <
>>>> dgabrieli@salesforce.com> wrote:
>>>>
>>>>> Maybe someone can correct me if I am wrong but in the code I believe
>>>>> Elasticsearch is used instead of "resulting LLR is what goes into the
>>>>> AB element in matrix PtP or PtL."
>>>>>
>>>>> By default the strongest 50 LLR scores get set as searchable
>>>>> values in Elasticsearch per item-event pair.
>>>>>
>>>>> You can configure the thresholds for significance using the
>>>>> configuration parameters: maxCorrelatorsPerItem or minLLR.  And this
>>>>> configuration is important because at default of 50 you may end up treating
>>>>> all "indicator values" as significant.  More info here:
>>>>> http://actionml.com/docs/ur_config
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Nov 17, 2017 at 4:50 AM Noelia Osés Fernández <
>>>>> noses@vicomtech.org> wrote:
>>>>>
>>>>>>
>>>>>> Let's see if I've understood how LLR is used in UR. Let P be the
>>>>>> matrix for the primary conversion indicator (say purchases) and Pt its
>>>>>> transposed.
>>>>>>
>>>>>> Then, with a second matrix, which can be P again to make PtP or a
>>>>>> matrix for a secondary indicator (say L for likes) to make PtL, we take a
>>>>>> row from Pt (item A) and a column from the second matrix (either P or L, in
>>>>>> this example) (item B) and we calculate the table that Ted Dunning explains
>>>>>> on his webpage: the number of coocurrences that item A *AND* B have
>>>>>> been purchased (or purchased AND liked), the number of times that item A
>>>>>>  *OR* B have been purchased (or purchased OR liked), and the number
>>>>>> of times that *neither* item A nor B have been purchased (or
>>>>>> purchased or liked). With this counts we calculate LLR following the
>>>>>> formulas that Ted Dunning provides and the resulting LLR is what goes into
>>>>>> the AB element in matrix PtP or PtL. Correct?
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>> On 16 November 2017 at 17:03, Noelia Osés Fernández <
>>>>>> noses@vicomtech.org> wrote:
>>>>>>
>>>>>>> Wonderful! Thanks Daniel!
>>>>>>>
>>>>>>> Suneel, I'm still new to the Apache ecosystem and so I know that
>>>>>>> Mahout is used but only vaguely... I still don't know the different parts
>>>>>>> well enough to have a good understanding of what each of them do (Spark,
>>>>>>> MLLib, PIO, Mahout,...)
>>>>>>>
>>>>>>> Thank you both!
>>>>>>>
>>>>>>> On 16 November 2017 at 16:59, Suneel Marthi <sm...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Indeed so. Ted Dunning is an Apache Mahout PMC and committer and
>>>>>>>> the whole idea of Search-based Recommenders stems from his work and
>>>>>>>> insights.  If u didn't know, the PIO UR uses Apache Mahout under the hood
>>>>>>>> and hence u see the LLR.
>>>>>>>>
>>>>>>>> On Thu, Nov 16, 2017 at 3:49 PM, Daniel Gabrieli <
>>>>>>>> dgabrieli@salesforce.com> wrote:
>>>>>>>>
>>>>>>>>> I am pretty sure the LLR stuff in UR is based off of this blog
>>>>>>>>> post and associated paper:
>>>>>>>>>
>>>>>>>>> http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
>>>>>>>>>
>>>>>>>>> Accurate Methods for the Statistics of Surprise and Coincidence
>>>>>>>>> by Ted Dunning
>>>>>>>>>
>>>>>>>>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Nov 16, 2017 at 10:26 AM Noelia Osés Fernández <
>>>>>>>>> noses@vicomtech.org> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I've been trying to understand how the UR algorithm works and I
>>>>>>>>>> think I have a general idea. But I would like to have a *mathematical
>>>>>>>>>> description* of the step in which the LLR comes into play. In
>>>>>>>>>> the CCO presentations I have found it says:
>>>>>>>>>>
>>>>>>>>>> (PtP) compares column to column using
>>>>>>>>>> *log-likelihood based correlation test*
>>>>>>>>>>
>>>>>>>>>> However, I have searched for "log-likelihood based correlation
>>>>>>>>>> test" in google but no joy. All I get are explanations of the
>>>>>>>>>> likelihood-ratio test to compare two models.
>>>>>>>>>>
>>>>>>>>>> I would very much appreciate a math explanation of log-likelihood
>>>>>>>>>> based correlation test. Any pointers to papers or any other literature that
>>>>>>>>>> explains this specifically are much appreciated.
>>>>>>>>>>
>>>>>>>>>> Best regards,
>>>>>>>>>> Noelia
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "actionml-user" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to actionml-user+unsubscribe@googlegroups.com.
>>>> To post to this group, send email to actionml-user@googlegroups.com.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/actionml-user/CAA2BRS%2Boj%2BNYDmsNNd2mYM1ZC5CgWwC71W3%3DEhrO9qeOiKyWXA%40mail.gmail.com
>>>> <https://groups.google.com/d/msgid/actionml-user/CAA2BRS%2Boj%2BNYDmsNNd2mYM1ZC5CgWwC71W3%3DEhrO9qeOiKyWXA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>>
>>>
>>>
>>> --
>>> <http://www.vicomtech.org>
>>>
>>> Noelia Osés Fernández, PhD
>>> Senior Researcher |
>>> Investigadora Senior
>>>
>>> noses@vicomtech.org
>>> +[34] 943 30 92 30 <+34%20943%2030%2092%2030>
>>> Data Intelligence for Energy and
>>> Industrial Processes | Inteligencia
>>> de Datos para Energía y Procesos
>>> Industriales
>>>
>>> <https://www.linkedin.com/company/vicomtech>
>>> <https://www.youtube.com/user/VICOMTech>
>>> <ht...@Vicomtech_IK4>
>>>
>>> member of:  <http://www.graphicsmedia.net/>     <http://www.ik4.es>
>>>
>>> Legal Notice - Privacy policy
>>> <http://www.vicomtech.org/en/proteccion-datos>
>>>
>>
>
>
> --
> <http://www.vicomtech.org>
>
> Noelia Osés Fernández, PhD
> Senior Researcher |
> Investigadora Senior
>
> noses@vicomtech.org
> +[34] 943 30 92 30 <+34%20943%2030%2092%2030>
> Data Intelligence for Energy and
> Industrial Processes | Inteligencia
> de Datos para Energía y Procesos
> Industriales
>
> <https://www.linkedin.com/company/vicomtech>
> <https://www.youtube.com/user/VICOMTech>
> <ht...@Vicomtech_IK4>
>
> member of:  <http://www.graphicsmedia.net/>     <http://www.ik4.es>
>
> Legal Notice - Privacy policy
> <http://www.vicomtech.org/en/proteccion-datos>
>

Re: Log-likelihood based correlation test?

Posted by Noelia Osés Fernández <no...@vicomtech.org>.

Thanks Daniel!

And excuse my ignorance but... how do you inspect the ES index?

On 20 November 2017 at 15:29, Daniel Gabrieli <dg...@salesforce.com>
wrote:

> There is this cli tool and article with more information that does produce
> scores:
>
> https://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html
>
> But I don't know of any commands that return diagnostics about LLR from
> the PIO framework / UR engine.  That would be a nice feature if it doesn't
> exist.  The way I've gotten some insight into what the model is doing is by
> when using PIO / UR is by inspecting the the ElasticSearch index that gets
> created because it has the "significant" values populated in the documents
> (though not the actual LLR scores).
>
> On Mon, Nov 20, 2017 at 7:22 AM Noelia Osés Fernández <no...@vicomtech.org>
> wrote:
>
>> This thread is very enlightening, thank you very much!
>>
>> Is there a way I can see what the P, PtP, and PtL matrices of an app are?
>> In the handmade case, for example?
>>
>> Are there any pio calls I can use to get these?
>>
>> On 17 November 2017 at 19:52, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>
>>> Mahout builds the model by doing matrix multiplication (PtP) then
>>> calculating the LLR score for every non-zero value. We then keep the top K
>>> or use a threshold to decide whether to keep of not (both are supported in
>>> the UR). LLR is a metric for seeing how likely 2 events in a large group
>>> are correlated. Therefore LLR is only used to remove weak data from the
>>> model.
>>>
>>> So Mahout builds the model then it is put into Elasticsearch which is
>>> used as a KNN (K-nearest Neighbors) engine. The LLR score is not put into
>>> the model only an indicator that the item survived the LLR test.
>>>
>>> The KNN is applied using the user’s history as the query and finding
>>> items the most closely match it. Since PtP will have items in rows and the
>>> row will have correlating items, this “search” methods work quite well to
>>> find items that had very similar items purchased with it as are in the
>>> user’s history.
>>>
>>> =============================== that is the simple explanation
>>> ========================================
>>>
>>> Item-based recs take the model items (correlated items by the LLR test)
>>> as the query and the results are the most similar items—the items with most
>>> similar correlating items.
>>>
>>> The model is items in rows and items in columns if you are only using
>>> one event. PtP. If you think it through, it is all purchased items in as
>>> the row key and other items purchased along with the row key. LLR filters
>>> out the weakly correlating non-zero values (0 mean no evidence of
>>> correlation anyway). If we didn’t do this it would be purely a
>>> “Cooccurrence” recommender, one of the first useful ones. But filtering
>>> based on cooccurrence strength (PtP values without LLR applied to them)
>>> produces much worse results than using LLR to filter for most highly
>>> correlated cooccurrences. You get a similar effect with Matrix
>>> Factorization but you can only use one type of event for various reasons.
>>>
>>> Since LLR is a probabilistic metric that only looks at counts, it can be
>>> applied equally well to PtV (purchase, view), PtS (purchase, search terms),
>>> PtC (purchase, category-preferences). We did an experiment using Mean
>>> Average Precision for the UR using video “Likes” vs “Likes” and “Dislikes”
>>> so LtL vs. LtL and LtD scraped from rottentomatoes.com reviews and got
>>> a 20% lift in the MAP@k score by including data for “Dislikes”.
>>> https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-
>>> occurences/
>>>
>>> So the benefit and use of LLR is to filter weak data from the model and
>>> allow us to see if dislikes, and other events, correlate with likes. Adding
>>> this type of data, that is usually thrown away is one the the most powerful
>>> reasons to use the algorithm—BTW the algorithm is called Correlated
>>> Cross-Occurrence (CCO).
>>>
>>> The benefit of using Lucene (at the heart of Elasticsearch) to do the
>>> KNN query is that is it fast, taking the user’s realtime events into the
>>> query but also because it is is trivial to add all sorts or business rules.
>>> like give me recs based on user events but only ones from a certain
>>> category, of give me recs but only ones tagged as “in-stock” in fact the
>>> business rules can have inclusion rules, exclusion rules, and be mixed with
>>> ANDs and ORs.
>>>
>>> BTW there is a version ready for testing with PIO 0.12.0 and ES5 here:
>>> https://github.com/actionml/universal-recommender/tree/0.7.0-SNAPSHOT Instructions
>>> in the readme and notice it is in the 0.7.0-SNAPSHOT branch.
>>>
>>>
>>> On Nov 17, 2017, at 7:59 AM, Andrew Troemner <at...@salesforce.com>
>>> wrote:
>>>
>>> I'll echo Dan here. He and I went through the raw Mahout libraries
>>> called by the Universal Recommender, and while Noelia's description is
>>> accurate for an intermediate step, the indexing via ElasticSearch generates
>>> some separate relevancy scores based on their Lucene indexing scheme. The
>>> raw LLR scores are used in building this process, but the final scores
>>> served up by the API's should be post-processed, and cannot be used to
>>> reconstruct the raw LLR's (to my understanding).
>>>
>>> There are also some additional steps including down-sampling, which
>>> scrubs out very rare combinations (which otherwise would have very high
>>> LLR's for a single observation), which partially corrects for the
>>> statistical problem of multiple detection. But the underlying logic is per
>>> Ted Dunning's research and summarized by Noelia, and is a solid way to
>>> approach interaction effects for tens of thousands of items and including
>>> secondary indicators (like demographics, or implicit preferences).
>>>
>>>
>>> *ANDREW TROEMNER*Associate Principal Data Scientist | salesforce.com
>>> Office: 317.832.4404 <(317)%20832-4404>
>>> Mobile: 317.531.0216 <(317)%20531-0216>
>>>
>>>
>>>
>>> <http://smart.salesforce.com/sig/atroemner//us_mb_kb/default/link.html>
>>>
>>> On Fri, Nov 17, 2017 at 9:55 AM, Daniel Gabrieli <dgabrieli@
>>> salesforce.com> wrote:
>>>
>>>> Maybe someone can correct me if I am wrong but in the code I believe
>>>> Elasticsearch is used instead of "resulting LLR is what goes into the
>>>> AB element in matrix PtP or PtL."
>>>>
>>>> By default the strongest 50 LLR scores get set as searchable values in
>>>> Elasticsearch per item-event pair.
>>>>
>>>> You can configure the thresholds for significance using the
>>>> configuration parameters: maxCorrelatorsPerItem or minLLR.  And this
>>>> configuration is important because at default of 50 you may end up treating
>>>> all "indicator values" as significant.  More info here:
>>>> http://actionml.com/docs/ur_config
>>>>
>>>>
>>>>
>>>> On Fri, Nov 17, 2017 at 4:50 AM Noelia Osés Fernández <
>>>> noses@vicomtech.org> wrote:
>>>>
>>>>>
>>>>> Let's see if I've understood how LLR is used in UR. Let P be the
>>>>> matrix for the primary conversion indicator (say purchases) and Pt its
>>>>> transposed.
>>>>>
>>>>> Then, with a second matrix, which can be P again to make PtP or a
>>>>> matrix for a secondary indicator (say L for likes) to make PtL, we take a
>>>>> row from Pt (item A) and a column from the second matrix (either P or L, in
>>>>> this example) (item B) and we calculate the table that Ted Dunning explains
>>>>> on his webpage: the number of coocurrences that item A *AND* B have
>>>>> been purchased (or purchased AND liked), the number of times that item A
>>>>>  *OR* B have been purchased (or purchased OR liked), and the number
>>>>> of times that *neither* item A nor B have been purchased (or
>>>>> purchased or liked). With this counts we calculate LLR following the
>>>>> formulas that Ted Dunning provides and the resulting LLR is what goes into
>>>>> the AB element in matrix PtP or PtL. Correct?
>>>>>
>>>>> Thank you!
>>>>>
>>>>> On 16 November 2017 at 17:03, Noelia Osés Fernández <
>>>>> noses@vicomtech.org> wrote:
>>>>>
>>>>>> Wonderful! Thanks Daniel!
>>>>>>
>>>>>> Suneel, I'm still new to the Apache ecosystem and so I know that
>>>>>> Mahout is used but only vaguely... I still don't know the different parts
>>>>>> well enough to have a good understanding of what each of them do (Spark,
>>>>>> MLLib, PIO, Mahout,...)
>>>>>>
>>>>>> Thank you both!
>>>>>>
>>>>>> On 16 November 2017 at 16:59, Suneel Marthi <sm...@apache.org> wr
>>>>>> ote:
>>>>>>
>>>>>>> Indeed so. Ted Dunning is an Apache Mahout PMC and committer and the
>>>>>>> whole idea of Search-based Recommenders stems from his work and insights.
>>>>>>> If u didn't know, the PIO UR uses Apache Mahout under the hood and hence u
>>>>>>> see the LLR.
>>>>>>>
>>>>>>> On Thu, Nov 16, 2017 at 3:49 PM, Daniel Gabrieli <dgabrieli@
>>>>>>> salesforce.com> wrote:
>>>>>>>
>>>>>>>> I am pretty sure the LLR stuff in UR is based off of this blog post
>>>>>>>> and associated paper:
>>>>>>>>
>>>>>>>> http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
>>>>>>>>
>>>>>>>> Accurate Methods for the Statistics of Surprise and Coincidence
>>>>>>>> by Ted Dunning
>>>>>>>>
>>>>>>>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Nov 16, 2017 at 10:26 AM Noelia Osés Fernández <
>>>>>>>> noses@vicomtech.org> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I've been trying to understand how the UR algorithm works and I
>>>>>>>>> think I have a general idea. But I would like to have a *mathematical
>>>>>>>>> description* of the step in which the LLR comes into play. In the
>>>>>>>>> CCO presentations I have found it says:
>>>>>>>>>
>>>>>>>>> (PtP) compares column to column using
>>>>>>>>> *log-likelihood based correlation test*
>>>>>>>>>
>>>>>>>>> However, I have searched for "log-likelihood based correlation
>>>>>>>>> test" in google but no joy. All I get are explanations of the
>>>>>>>>> likelihood-ratio test to compare two models.
>>>>>>>>>
>>>>>>>>> I would very much appreciate a math explanation of log-likelihood
>>>>>>>>> based correlation test. Any pointers to papers or any other literature that
>>>>>>>>> explains this specifically are much appreciated.
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>> Noelia
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "actionml-user" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to actionml-user+unsubscribe@googlegroups.com.
>>> To post to this group, send email to actionml-user@googlegroups.com.
>>> To view this discussion on the web visit https://groups.google.
>>> com/d/msgid/actionml-user/CAA2BRS%2Boj%2BNYDmsNNd2mYM1ZC5CgWwC71W3%
>>> 3DEhrO9qeOiKyWXA%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/actionml-user/CAA2BRS%2Boj%2BNYDmsNNd2mYM1ZC5CgWwC71W3%3DEhrO9qeOiKyWXA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>>
>>
>>
>> --
>> <http://www.vicomtech.org>
>>
>> Noelia Osés Fernández, PhD
>> Senior Researcher |
>> Investigadora Senior
>>
>> noses@vicomtech.org
>> +[34] 943 30 92 30 <+34%20943%2030%2092%2030>
>> Data Intelligence for Energy and
>> Industrial Processes | Inteligencia
>> de Datos para Energía y Procesos
>> Industriales
>>
>> <https://www.linkedin.com/company/vicomtech>
>> <https://www.youtube.com/user/VICOMTech>
>> <ht...@Vicomtech_IK4>
>>
>> member of:  <http://www.graphicsmedia.net/>     <http://www.ik4.es>
>>
>> Legal Notice - Privacy policy
>> <http://www.vicomtech.org/en/proteccion-datos>
>>
>


-- 
<http://www.vicomtech.org>

Noelia Osés Fernández, PhD
Senior Researcher |
Investigadora Senior

noses@vicomtech.org
+[34] 943 30 92 30
Data Intelligence for Energy and
Industrial Processes | Inteligencia
de Datos para Energía y Procesos
Industriales

<https://www.linkedin.com/company/vicomtech>
<https://www.youtube.com/user/VICOMTech>
<ht...@Vicomtech_IK4>

member of:  <http://www.graphicsmedia.net/>     <http://www.ik4.es>

Legal Notice - Privacy policy <http://www.vicomtech.org/en/proteccion-datos>

Re: Log-likelihood based correlation test?

Posted by Daniel Gabrieli <dg...@salesforce.com>.

There is this cli tool and article with more information that does produce
scores:

https://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html

But I don't know of any commands that return diagnostics about LLR from the
PIO framework / UR engine.  That would be a nice feature if it doesn't
exist.  The way I've gotten some insight into what the model is doing is by
when using PIO / UR is by inspecting the the ElasticSearch index that gets
created because it has the "significant" values populated in the documents
(though not the actual LLR scores).

On Mon, Nov 20, 2017 at 7:22 AM Noelia Osés Fernández <no...@vicomtech.org>
wrote:

> This thread is very enlightening, thank you very much!
>
> Is there a way I can see what the P, PtP, and PtL matrices of an app are?
> In the handmade case, for example?
>
> Are there any pio calls I can use to get these?
>
> On 17 November 2017 at 19:52, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
>> Mahout builds the model by doing matrix multiplication (PtP) then
>> calculating the LLR score for every non-zero value. We then keep the top K
>> or use a threshold to decide whether to keep of not (both are supported in
>> the UR). LLR is a metric for seeing how likely 2 events in a large group
>> are correlated. Therefore LLR is only used to remove weak data from the
>> model.
>>
>> So Mahout builds the model then it is put into Elasticsearch which is
>> used as a KNN (K-nearest Neighbors) engine. The LLR score is not put into
>> the model only an indicator that the item survived the LLR test.
>>
>> The KNN is applied using the user’s history as the query and finding
>> items the most closely match it. Since PtP will have items in rows and the
>> row will have correlating items, this “search” methods work quite well to
>> find items that had very similar items purchased with it as are in the
>> user’s history.
>>
>> =============================== that is the simple explanation
>> ========================================
>>
>> Item-based recs take the model items (correlated items by the LLR test)
>> as the query and the results are the most similar items—the items with most
>> similar correlating items.
>>
>> The model is items in rows and items in columns if you are only using one
>> event. PtP. If you think it through, it is all purchased items in as the
>> row key and other items purchased along with the row key. LLR filters out
>> the weakly correlating non-zero values (0 mean no evidence of correlation
>> anyway). If we didn’t do this it would be purely a “Cooccurrence”
>> recommender, one of the first useful ones. But filtering based on
>> cooccurrence strength (PtP values without LLR applied to them) produces
>> much worse results than using LLR to filter for most highly correlated
>> cooccurrences. You get a similar effect with Matrix Factorization but you
>> can only use one type of event for various reasons.
>>
>> Since LLR is a probabilistic metric that only looks at counts, it can be
>> applied equally well to PtV (purchase, view), PtS (purchase, search terms),
>> PtC (purchase, category-preferences). We did an experiment using Mean
>> Average Precision for the UR using video “Likes” vs “Likes” and “Dislikes”
>> so LtL vs. LtL and LtD scraped from rottentomatoes.com reviews and got a
>> 20% lift in the MAP@k score by including data for “Dislikes”.
>> https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/
>>
>> So the benefit and use of LLR is to filter weak data from the model and
>> allow us to see if dislikes, and other events, correlate with likes. Adding
>> this type of data, that is usually thrown away is one the the most powerful
>> reasons to use the algorithm—BTW the algorithm is called Correlated
>> Cross-Occurrence (CCO).
>>
>> The benefit of using Lucene (at the heart of Elasticsearch) to do the KNN
>> query is that is it fast, taking the user’s realtime events into the query
>> but also because it is is trivial to add all sorts or business rules. like
>> give me recs based on user events but only ones from a certain category, of
>> give me recs but only ones tagged as “in-stock” in fact the business rules
>> can have inclusion rules, exclusion rules, and be mixed with ANDs and ORs.
>>
>> BTW there is a version ready for testing with PIO 0.12.0 and ES5 here:
>> https://github.com/actionml/universal-recommender/tree/0.7.0-SNAPSHOT Instructions
>> in the readme and notice it is in the 0.7.0-SNAPSHOT branch.
>>
>>
>> On Nov 17, 2017, at 7:59 AM, Andrew Troemner <at...@salesforce.com>
>> wrote:
>>
>> I'll echo Dan here. He and I went through the raw Mahout libraries called
>> by the Universal Recommender, and while Noelia's description is accurate
>> for an intermediate step, the indexing via ElasticSearch generates some
>> separate relevancy scores based on their Lucene indexing scheme. The raw
>> LLR scores are used in building this process, but the final scores served
>> up by the API's should be post-processed, and cannot be used to reconstruct
>> the raw LLR's (to my understanding).
>>
>> There are also some additional steps including down-sampling, which
>> scrubs out very rare combinations (which otherwise would have very high
>> LLR's for a single observation), which partially corrects for the
>> statistical problem of multiple detection. But the underlying logic is per
>> Ted Dunning's research and summarized by Noelia, and is a solid way to
>> approach interaction effects for tens of thousands of items and including
>> secondary indicators (like demographics, or implicit preferences).
>>
>>
>> *ANDREW TROEMNER*Associate Principal Data Scientist | salesforce.com
>> Office: 317.832.4404 <(317)%20832-4404>
>> Mobile: 317.531.0216 <(317)%20531-0216>
>>
>>
>>
>> <http://smart.salesforce.com/sig/atroemner//us_mb_kb/default/link.html>
>>
>> On Fri, Nov 17, 2017 at 9:55 AM, Daniel Gabrieli <
>> dgabrieli@salesforce.com> wrote:
>>
>>> Maybe someone can correct me if I am wrong but in the code I believe
>>> Elasticsearch is used instead of "resulting LLR is what goes into the
>>> AB element in matrix PtP or PtL."
>>>
>>> By default the strongest 50 LLR scores get set as searchable values in
>>> Elasticsearch per item-event pair.
>>>
>>> You can configure the thresholds for significance using the
>>> configuration parameters: maxCorrelatorsPerItem or minLLR.  And this
>>> configuration is important because at default of 50 you may end up treating
>>> all "indicator values" as significant.  More info here:
>>> http://actionml.com/docs/ur_config
>>>
>>>
>>>
>>> On Fri, Nov 17, 2017 at 4:50 AM Noelia Osés Fernández <
>>> noses@vicomtech.org> wrote:
>>>
>>>>
>>>> Let's see if I've understood how LLR is used in UR. Let P be the matrix
>>>> for the primary conversion indicator (say purchases) and Pt its transposed.
>>>>
>>>>
>>>> Then, with a second matrix, which can be P again to make PtP or a
>>>> matrix for a secondary indicator (say L for likes) to make PtL, we take a
>>>> row from Pt (item A) and a column from the second matrix (either P or L, in
>>>> this example) (item B) and we calculate the table that Ted Dunning explains
>>>> on his webpage: the number of coocurrences that item A *AND* B have
>>>> been purchased (or purchased AND liked), the number of times that item A
>>>>  *OR* B have been purchased (or purchased OR liked), and the number of
>>>> times that *neither* item A nor B have been purchased (or purchased or
>>>> liked). With this counts we calculate LLR following the formulas that Ted
>>>> Dunning provides and the resulting LLR is what goes into the AB element in
>>>> matrix PtP or PtL. Correct?
>>>>
>>>> Thank you!
>>>>
>>>> On 16 November 2017 at 17:03, Noelia Osés Fernández <
>>>> noses@vicomtech.org> wrote:
>>>>
>>>>> Wonderful! Thanks Daniel!
>>>>>
>>>>> Suneel, I'm still new to the Apache ecosystem and so I know that
>>>>> Mahout is used but only vaguely... I still don't know the different parts
>>>>> well enough to have a good understanding of what each of them do (Spark,
>>>>> MLLib, PIO, Mahout,...)
>>>>>
>>>>> Thank you both!
>>>>>
>>>>> On 16 November 2017 at 16:59, Suneel Marthi <sm...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Indeed so. Ted Dunning is an Apache Mahout PMC and committer and the
>>>>>> whole idea of Search-based Recommenders stems from his work and insights.
>>>>>> If u didn't know, the PIO UR uses Apache Mahout under the hood and hence u
>>>>>> see the LLR.
>>>>>>
>>>>>> On Thu, Nov 16, 2017 at 3:49 PM, Daniel Gabrieli <
>>>>>> dgabrieli@salesforce.com> wrote:
>>>>>>
>>>>>>> I am pretty sure the LLR stuff in UR is based off of this blog post
>>>>>>> and associated paper:
>>>>>>>
>>>>>>> http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
>>>>>>>
>>>>>>> Accurate Methods for the Statistics of Surprise and Coincidence
>>>>>>> by Ted Dunning
>>>>>>>
>>>>>>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Nov 16, 2017 at 10:26 AM Noelia Osés Fernández <
>>>>>>> noses@vicomtech.org> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I've been trying to understand how the UR algorithm works and I
>>>>>>>> think I have a general idea. But I would like to have a *mathematical
>>>>>>>> description* of the step in which the LLR comes into play. In the
>>>>>>>> CCO presentations I have found it says:
>>>>>>>>
>>>>>>>> (PtP) compares column to column using
>>>>>>>> *log-likelihood based correlation test*
>>>>>>>>
>>>>>>>> However, I have searched for "log-likelihood based correlation
>>>>>>>> test" in google but no joy. All I get are explanations of the
>>>>>>>> likelihood-ratio test to compare two models.
>>>>>>>>
>>>>>>>> I would very much appreciate a math explanation of log-likelihood
>>>>>>>> based correlation test. Any pointers to papers or any other literature that
>>>>>>>> explains this specifically are much appreciated.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Noelia
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "actionml-user" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to actionml-user+unsubscribe@googlegroups.com.
>> To post to this group, send email to actionml-user@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/actionml-user/CAA2BRS%2Boj%2BNYDmsNNd2mYM1ZC5CgWwC71W3%3DEhrO9qeOiKyWXA%40mail.gmail.com
>> <https://groups.google.com/d/msgid/actionml-user/CAA2BRS%2Boj%2BNYDmsNNd2mYM1ZC5CgWwC71W3%3DEhrO9qeOiKyWXA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>
>
> --
> <http://www.vicomtech.org>
>
> Noelia Osés Fernández, PhD
> Senior Researcher |
> Investigadora Senior
>
> noses@vicomtech.org
> +[34] 943 30 92 30 <+34%20943%2030%2092%2030>
> Data Intelligence for Energy and
> Industrial Processes | Inteligencia
> de Datos para Energía y Procesos
> Industriales
>
> <https://www.linkedin.com/company/vicomtech>
> <https://www.youtube.com/user/VICOMTech>
> <ht...@Vicomtech_IK4>
>
> member of:  <http://www.graphicsmedia.net/>     <http://www.ik4.es>
>
> Legal Notice - Privacy policy
> <http://www.vicomtech.org/en/proteccion-datos>
>

Re: Log-likelihood based correlation test?

Posted by Noelia Osés Fernández <no...@vicomtech.org>.

This thread is very enlightening, thank you very much!

Is there a way I can see what the P, PtP, and PtL matrices of an app are?
In the handmade case, for example?

Are there any pio calls I can use to get these?

On 17 November 2017 at 19:52, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Mahout builds the model by doing matrix multiplication (PtP) then
> calculating the LLR score for every non-zero value. We then keep the top K
> or use a threshold to decide whether to keep of not (both are supported in
> the UR). LLR is a metric for seeing how likely 2 events in a large group
> are correlated. Therefore LLR is only used to remove weak data from the
> model.
>
> So Mahout builds the model then it is put into Elasticsearch which is used
> as a KNN (K-nearest Neighbors) engine. The LLR score is not put into the
> model only an indicator that the item survived the LLR test.
>
> The KNN is applied using the user’s history as the query and finding items
> the most closely match it. Since PtP will have items in rows and the row
> will have correlating items, this “search” methods work quite well to find
> items that had very similar items purchased with it as are in the user’s
> history.
>
> =============================== that is the simple explanation
> ========================================
>
> Item-based recs take the model items (correlated items by the LLR test) as
> the query and the results are the most similar items—the items with most
> similar correlating items.
>
> The model is items in rows and items in columns if you are only using one
> event. PtP. If you think it through, it is all purchased items in as the
> row key and other items purchased along with the row key. LLR filters out
> the weakly correlating non-zero values (0 mean no evidence of correlation
> anyway). If we didn’t do this it would be purely a “Cooccurrence”
> recommender, one of the first useful ones. But filtering based on
> cooccurrence strength (PtP values without LLR applied to them) produces
> much worse results than using LLR to filter for most highly correlated
> cooccurrences. You get a similar effect with Matrix Factorization but you
> can only use one type of event for various reasons.
>
> Since LLR is a probabilistic metric that only looks at counts, it can be
> applied equally well to PtV (purchase, view), PtS (purchase, search terms),
> PtC (purchase, category-preferences). We did an experiment using Mean
> Average Precision for the UR using video “Likes” vs “Likes” and “Dislikes”
> so LtL vs. LtL and LtD scraped from rottentomatoes.com reviews and got a
> 20% lift in the MAP@k score by including data for “Dislikes”.
> https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-
> occurences/
>
> So the benefit and use of LLR is to filter weak data from the model and
> allow us to see if dislikes, and other events, correlate with likes. Adding
> this type of data, that is usually thrown away is one the the most powerful
> reasons to use the algorithm—BTW the algorithm is called Correlated
> Cross-Occurrence (CCO).
>
> The benefit of using Lucene (at the heart of Elasticsearch) to do the KNN
> query is that is it fast, taking the user’s realtime events into the query
> but also because it is is trivial to add all sorts or business rules. like
> give me recs based on user events but only ones from a certain category, of
> give me recs but only ones tagged as “in-stock” in fact the business rules
> can have inclusion rules, exclusion rules, and be mixed with ANDs and ORs.
>
> BTW there is a version ready for testing with PIO 0.12.0 and ES5 here:
> https://github.com/actionml/universal-recommender/tree/0.7.0-SNAPSHOT Instructions
> in the readme and notice it is in the 0.7.0-SNAPSHOT branch.
>
>
> On Nov 17, 2017, at 7:59 AM, Andrew Troemner <at...@salesforce.com>
> wrote:
>
> I'll echo Dan here. He and I went through the raw Mahout libraries called
> by the Universal Recommender, and while Noelia's description is accurate
> for an intermediate step, the indexing via ElasticSearch generates some
> separate relevancy scores based on their Lucene indexing scheme. The raw
> LLR scores are used in building this process, but the final scores served
> up by the API's should be post-processed, and cannot be used to reconstruct
> the raw LLR's (to my understanding).
>
> There are also some additional steps including down-sampling, which scrubs
> out very rare combinations (which otherwise would have very high LLR's for
> a single observation), which partially corrects for the statistical problem
> of multiple detection. But the underlying logic is per Ted Dunning's
> research and summarized by Noelia, and is a solid way to approach
> interaction effects for tens of thousands of items and including secondary
> indicators (like demographics, or implicit preferences).
>
>
> *ANDREW TROEMNER*Associate Principal Data Scientist | salesforce.com
> Office: 317.832.4404
> Mobile: 317.531.0216
>
>
>
> <http://smart.salesforce.com/sig/atroemner//us_mb_kb/default/link.html>
>
> On Fri, Nov 17, 2017 at 9:55 AM, Daniel Gabrieli <dgabrieli@salesforce.com
> > wrote:
>
>> Maybe someone can correct me if I am wrong but in the code I believe
>> Elasticsearch is used instead of "resulting LLR is what goes into the AB
>> element in matrix PtP or PtL."
>>
>> By default the strongest 50 LLR scores get set as searchable values in
>> Elasticsearch per item-event pair.
>>
>> You can configure the thresholds for significance using the configuration
>> parameters: maxCorrelatorsPerItem or minLLR.  And this configuration is
>> important because at default of 50 you may end up treating all "indicator
>> values" as significant.  More info here: http://actionml.com/docs
>> /ur_config
>>
>>
>>
>> On Fri, Nov 17, 2017 at 4:50 AM Noelia Osés Fernández <
>> noses@vicomtech.org> wrote:
>>
>>>
>>> Let's see if I've understood how LLR is used in UR. Let P be the matrix
>>> for the primary conversion indicator (say purchases) and Pt its transposed.
>>>
>>>
>>> Then, with a second matrix, which can be P again to make PtP or a matrix
>>> for a secondary indicator (say L for likes) to make PtL, we take a row from
>>> Pt (item A) and a column from the second matrix (either P or L, in this
>>> example) (item B) and we calculate the table that Ted Dunning explains on
>>> his webpage: the number of coocurrences that item A *AND* B have been
>>> purchased (or purchased AND liked), the number of times that item A *OR*
>>>  B have been purchased (or purchased OR liked), and the number of times
>>> that *neither* item A nor B have been purchased (or purchased or
>>> liked). With this counts we calculate LLR following the formulas that Ted
>>> Dunning provides and the resulting LLR is what goes into the AB element in
>>> matrix PtP or PtL. Correct?
>>>
>>> Thank you!
>>>
>>> On 16 November 2017 at 17:03, Noelia Osés Fernández <noses@vicomtech.org
>>> > wrote:
>>>
>>>> Wonderful! Thanks Daniel!
>>>>
>>>> Suneel, I'm still new to the Apache ecosystem and so I know that Mahout
>>>> is used but only vaguely... I still don't know the different parts well
>>>> enough to have a good understanding of what each of them do (Spark, MLLib,
>>>> PIO, Mahout,...)
>>>>
>>>> Thank you both!
>>>>
>>>> On 16 November 2017 at 16:59, Suneel Marthi <sm...@apache.org> wrote:
>>>>
>>>>> Indeed so. Ted Dunning is an Apache Mahout PMC and committer and the
>>>>> whole idea of Search-based Recommenders stems from his work and insights.
>>>>> If u didn't know, the PIO UR uses Apache Mahout under the hood and hence u
>>>>> see the LLR.
>>>>>
>>>>> On Thu, Nov 16, 2017 at 3:49 PM, Daniel Gabrieli <dgabrieli@
>>>>> salesforce.com> wrote:
>>>>>
>>>>>> I am pretty sure the LLR stuff in UR is based off of this blog post
>>>>>> and associated paper:
>>>>>>
>>>>>> http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
>>>>>>
>>>>>> Accurate Methods for the Statistics of Surprise and Coincidence
>>>>>> by Ted Dunning
>>>>>>
>>>>>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962
>>>>>>
>>>>>>
>>>>>> On Thu, Nov 16, 2017 at 10:26 AM Noelia Osés Fernández <
>>>>>> noses@vicomtech.org> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I've been trying to understand how the UR algorithm works and I
>>>>>>> think I have a general idea. But I would like to have a *mathematical
>>>>>>> description* of the step in which the LLR comes into play. In the
>>>>>>> CCO presentations I have found it says:
>>>>>>>
>>>>>>> (PtP) compares column to column using
>>>>>>> *log-likelihood based correlation test*
>>>>>>>
>>>>>>> However, I have searched for "log-likelihood based correlation test"
>>>>>>> in google but no joy. All I get are explanations of the likelihood-ratio
>>>>>>> test to compare two models.
>>>>>>>
>>>>>>> I would very much appreciate a math explanation of log-likelihood
>>>>>>> based correlation test. Any pointers to papers or any other literature that
>>>>>>> explains this specifically are much appreciated.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Noelia
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "actionml-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to actionml-user+unsubscribe@googlegroups.com.
> To post to this group, send email to actionml-user@googlegroups.com.
> To view this discussion on the web visit https://groups.google.
> com/d/msgid/actionml-user/CAA2BRS%2Boj%2BNYDmsNNd2mYM1ZC5CgWwC71W3%
> 3DEhrO9qeOiKyWXA%40mail.gmail.com
> <https://groups.google.com/d/msgid/actionml-user/CAA2BRS%2Boj%2BNYDmsNNd2mYM1ZC5CgWwC71W3%3DEhrO9qeOiKyWXA%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
>


-- 
<http://www.vicomtech.org>

Noelia Osés Fernández, PhD
Senior Researcher |
Investigadora Senior

noses@vicomtech.org
+[34] 943 30 92 30
Data Intelligence for Energy and
Industrial Processes | Inteligencia
de Datos para Energía y Procesos
Industriales

<https://www.linkedin.com/company/vicomtech>
<https://www.youtube.com/user/VICOMTech>
<ht...@Vicomtech_IK4>

member of:  <http://www.graphicsmedia.net/>     <http://www.ik4.es>

Legal Notice - Privacy policy <http://www.vicomtech.org/en/proteccion-datos>

Re: Log-likelihood based correlation test?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Use the default. Tuning with a threshold is only for atypical data and unless you have a harness for cross-validation you would not know if you were making things worse or better. We have our own tools for this but have never had the need for threshold tuning. 

Yes, llrDownsampled(PtP) is the “model”, each doc put into Elasticsearch is a sparse representation of a row from it, along with those from PtV, PtC,… Each gets a “field” in the doc.


On Nov 22, 2017, at 6:16 AM, Noelia Osés Fernández <no...@vicomtech.org> wrote:

Thanks Pat!

How can I tune the threshold?

And when you say "compare to each item in the model", do you mean each row in PtP?

On 21 November 2017 at 19:56, Pat Ferrel <pat@occamsmachete.com <ma...@occamsmachete.com>> wrote:
No PtP non-zero elements have LLR calculated. The highest scores in the row are kept, or ones above some threshold hte resst are removeda as “noise". These are put into the Elasticsearch model without scores. 

Elasticsearch compares the similarity of the user history to each item in the model to find the KNN similar ones. This uses OKAPI BM25 from Lucene, which has several benefits over pure cosines (it actually consists of adjustments to cosine) and we also use norms. With ES 5 we should see quality improvements due to this. https://www.elastic.co/guide/en/elasticsearch/guide/master/pluggable-similarites.html <https://www.elastic.co/guide/en/elasticsearch/guide/master/pluggable-similarites.html>



On Nov 21, 2017, at 1:28 AM, Noelia Osés Fernández <noses@vicomtech.org <ma...@vicomtech.org>> wrote:

Pat,

If I understood your explanation correctly, you say that some elements of PtP are removed by the LLR (set to zero, to be precise). But the elements that survive are calculated by matrix multiplication. The final PtP is put into EleasticSearc and when we query for user recommendations ES uses KNN to find the items (the rows in PtP) that are most similar to the user's history.

If the non-zero elements of PtP have been calculated by straight matrix multiplication, and I'm assuming that the P matrix only has 0s and 1s to indicate which items have been purchased by which user, then the elements of PtP are either 0 or greater to or equal than 1. However, the scores I get are below 1.

So is the KNN using cosine similarity as a metric to calculate the closest neighbours? And is the results of this cosine similarity metric what is returned as a 'score'?

If it is, when it is greater than 1, is this because the different cosine similarities are added together i.e. PtP, PtL... ?

Thank you for all your valuable help!

On 17 November 2017 at 19:52, Pat Ferrel <pat@occamsmachete.com <ma...@occamsmachete.com>> wrote:
Mahout builds the model by doing matrix multiplication (PtP) then calculating the LLR score for every non-zero value. We then keep the top K or use a threshold to decide whether to keep of not (both are supported in the UR). LLR is a metric for seeing how likely 2 events in a large group are correlated. Therefore LLR is only used to remove weak data from the model.

So Mahout builds the model then it is put into Elasticsearch which is used as a KNN (K-nearest Neighbors) engine. The LLR score is not put into the model only an indicator that the item survived the LLR test.

The KNN is applied using the user’s history as the query and finding items the most closely match it. Since PtP will have items in rows and the row will have correlating items, this “search” methods work quite well to find items that had very similar items purchased with it as are in the user’s history.

=============================== that is the simple explanation ========================================

Item-based recs take the model items (correlated items by the LLR test) as the query and the results are the most similar items—the items with most similar correlating items.

The model is items in rows and items in columns if you are only using one event. PtP. If you think it through, it is all purchased items in as the row key and other items purchased along with the row key. LLR filters out the weakly correlating non-zero values (0 mean no evidence of correlation anyway). If we didn’t do this it would be purely a “Cooccurrence” recommender, one of the first useful ones. But filtering based on cooccurrence strength (PtP values without LLR applied to them) produces much worse results than using LLR to filter for most highly correlated cooccurrences. You get a similar effect with Matrix Factorization but you can only use one type of event for various reasons.

Since LLR is a probabilistic metric that only looks at counts, it can be applied equally well to PtV (purchase, view), PtS (purchase, search terms), PtC (purchase, category-preferences). We did an experiment using Mean Average Precision for the UR using video “Likes” vs “Likes” and “Dislikes” so LtL vs. LtL and LtD scraped from rottentomatoes.com <http://rottentomatoes.com/> reviews and got a 20% lift in the MAP@k score by including data for “Dislikes”. https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/ <https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/>

So the benefit and use of LLR is to filter weak data from the model and allow us to see if dislikes, and other events, correlate with likes. Adding this type of data, that is usually thrown away is one the the most powerful reasons to use the algorithm—BTW the algorithm is called Correlated Cross-Occurrence (CCO).

The benefit of using Lucene (at the heart of Elasticsearch) to do the KNN query is that is it fast, taking the user’s realtime events into the query but also because it is is trivial to add all sorts or business rules. like give me recs based on user events but only ones from a certain category, of give me recs but only ones tagged as “in-stock” in fact the business rules can have inclusion rules, exclusion rules, and be mixed with ANDs and ORs.

BTW there is a version ready for testing with PIO 0.12.0 and ES5 here: https://github.com/actionml/universal-recommender/tree/0.7.0-SNAPSHOT <https://github.com/actionml/universal-recommender/tree/0.7.0-SNAPSHOT> Instructions in the readme and notice it is in the 0.7.0-SNAPSHOT branch.


On Nov 17, 2017, at 7:59 AM, Andrew Troemner <atroemner@salesforce.com <ma...@salesforce.com>> wrote:

I'll echo Dan here. He and I went through the raw Mahout libraries called by the Universal Recommender, and while Noelia's description is accurate for an intermediate step, the indexing via ElasticSearch generates some separate relevancy scores based on their Lucene indexing scheme. The raw LLR scores are used in building this process, but the final scores served up by the API's should be post-processed, and cannot be used to reconstruct the raw LLR's (to my understanding).

There are also some additional steps including down-sampling, which scrubs out very rare combinations (which otherwise would have very high LLR's for a single observation), which partially corrects for the statistical problem of multiple detection. But the underlying logic is per Ted Dunning's research and summarized by Noelia, and is a solid way to approach interaction effects for tens of thousands of items and including secondary indicators (like demographics, or implicit preferences).

ANDREW TROEMNER
Associate Principal Data Scientist | salesforce.com <http://salesforce.com/>
Office: 317.832.4404
Mobile: 317.531.0216




 <http://smart.salesforce.com/sig/atroemner//us_mb_kb/default/link.html>
On Fri, Nov 17, 2017 at 9:55 AM, Daniel Gabrieli <dgabrieli@salesforce.com <ma...@salesforce.com>> wrote:
Maybe someone can correct me if I am wrong but in the code I believe Elasticsearch is used instead of "resulting LLR is what goes into the AB element in matrix PtP or PtL."

By default the strongest 50 LLR scores get set as searchable values in Elasticsearch per item-event pair.

You can configure the thresholds for significance using the configuration parameters: maxCorrelatorsPerItem or minLLR.  And this configuration is important because at default of 50 you may end up treating all "indicator values" as significant.  More info here: http://actionml.com/docs/ur_config <http://actionml.com/docs/ur_config>



On Fri, Nov 17, 2017 at 4:50 AM Noelia Osés Fernández <noses@vicomtech.org <ma...@vicomtech.org>> wrote:

Let's see if I've understood how LLR is used in UR. Let P be the matrix for the primary conversion indicator (say purchases) and Pt its transposed. 

Then, with a second matrix, which can be P again to make PtP or a matrix for a secondary indicator (say L for likes) to make PtL, we take a row from Pt (item A) and a column from the second matrix (either P or L, in this example) (item B) and we calculate the table that Ted Dunning explains on his webpage: the number of coocurrences that item A AND B have been purchased (or purchased AND liked), the number of times that item A OR B have been purchased (or purchased OR liked), and the number of times that neither item A nor B have been purchased (or purchased or liked). With this counts we calculate LLR following the formulas that Ted Dunning provides and the resulting LLR is what goes into the AB element in matrix PtP or PtL. Correct?   

Thank you!

On 16 November 2017 at 17:03, Noelia Osés Fernández <noses@vicomtech.org <ma...@vicomtech.org>> wrote:
Wonderful! Thanks Daniel!

Suneel, I'm still new to the Apache ecosystem and so I know that Mahout is used but only vaguely... I still don't know the different parts well enough to have a good understanding of what each of them do (Spark, MLLib, PIO, Mahout,...)

Thank you both!

On 16 November 2017 at 16:59, Suneel Marthi <smarthi@apache.org <ma...@apache.org>> wrote:
Indeed so. Ted Dunning is an Apache Mahout PMC and committer and the whole idea of Search-based Recommenders stems from his work and insights.  If u didn't know, the PIO UR uses Apache Mahout under the hood and hence u see the LLR.

On Thu, Nov 16, 2017 at 3:49 PM, Daniel Gabrieli <dgabrieli@salesforce.com <ma...@salesforce.com>> wrote:
I am pretty sure the LLR stuff in UR is based off of this blog post and associated paper:

http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html <http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html>

Accurate Methods for the Statistics of Surprise and Coincidence
by Ted Dunning

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962 <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962>


On Thu, Nov 16, 2017 at 10:26 AM Noelia Osés Fernández <noses@vicomtech.org <ma...@vicomtech.org>> wrote:
Hi,

I've been trying to understand how the UR algorithm works and I think I have a general idea. But I would like to have a mathematical description of the step in which the LLR comes into play. In the CCO presentations I have found it says:

(PtP) compares column to column using log-likelihood based correlation test


However, I have searched for "log-likelihood based correlation test" in google but no joy. All I get are explanations of the likelihood-ratio test to compare two models. 

I would very much appreciate a math explanation of log-likelihood based correlation test. Any pointers to papers or any other literature that explains this specifically are much appreciated.

Best regards,
Noelia












-- 
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-user+unsubscribe@googlegroups.com <ma...@googlegroups.com>.
To post to this group, send email to actionml-user@googlegroups.com <ma...@googlegroups.com>.
To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/CAA2BRS%2Boj%2BNYDmsNNd2mYM1ZC5CgWwC71W3%3DEhrO9qeOiKyWXA%40mail.gmail.com <https://groups.google.com/d/msgid/actionml-user/CAA2BRS%2Boj%2BNYDmsNNd2mYM1ZC5CgWwC71W3%3DEhrO9qeOiKyWXA%40mail.gmail.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.




-- 
 <http://www.vicomtech.org/>

Noelia Osés Fernández, PhD
Senior Researcher |
Investigadora Senior

noses@vicomtech.org <ma...@vicomtech.org>
+[34] 943 30 92 30
Data Intelligence for Energy and
Industrial Processes | Inteligencia
de Datos para Energía y Procesos
Industriales

 <https://www.linkedin.com/company/vicomtech>  <https://www.youtube.com/user/VICOMTech>  <ht...@Vicomtech_IK4>

member of:  <http://www.graphicsmedia.net/>     <http://www.ik4.es/>

Legal Notice - Privacy policy <http://www.vicomtech.org/en/proteccion-datos>

-- 
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-user+unsubscribe@googlegroups.com <ma...@googlegroups.com>.
To post to this group, send email to actionml-user@googlegroups.com <ma...@googlegroups.com>.
To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/CAMyseftsnWTn3UqrS5k3SgBJFgftqss6DbjLjo07FUR92HCKoA%40mail.gmail.com <https://groups.google.com/d/msgid/actionml-user/CAMyseftsnWTn3UqrS5k3SgBJFgftqss6DbjLjo07FUR92HCKoA%40mail.gmail.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.




-- 
 <http://www.vicomtech.org/>

Noelia Osés Fernández, PhD
Senior Researcher |
Investigadora Senior

noses@vicomtech.org <ma...@vicomtech.org>
+[34] 943 30 92 30
Data Intelligence for Energy and
Industrial Processes | Inteligencia
de Datos para Energía y Procesos
Industriales

 <https://www.linkedin.com/company/vicomtech>  <https://www.youtube.com/user/VICOMTech>  <ht...@Vicomtech_IK4>

member of:  <http://www.graphicsmedia.net/>     <http://www.ik4.es/>

Legal Notice - Privacy policy <http://www.vicomtech.org/en/proteccion-datos>

-- 
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-user+unsubscribe@googlegroups.com <ma...@googlegroups.com>.
To post to this group, send email to actionml-user@googlegroups.com <ma...@googlegroups.com>.
To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/CAMysefseu_oy5%2BRH9gADL1Z0tGPRUfMf8CCnwLWyb168sdADQQ%40mail.gmail.com <https://groups.google.com/d/msgid/actionml-user/CAMysefseu_oy5%2BRH9gADL1Z0tGPRUfMf8CCnwLWyb168sdADQQ%40mail.gmail.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.

Re: Log-likelihood based correlation test?

Posted by Noelia Osés Fernández <no...@vicomtech.org>.

Thanks Pat!

How can I tune the threshold?

And when you say "compare to each item in the model", do you mean each row
in PtP?

On 21 November 2017 at 19:56, Pat Ferrel <pa...@occamsmachete.com> wrote:

> No PtP non-zero elements have LLR calculated. The highest scores in the
> row are kept, or ones above some threshold hte resst are removeda as
> “noise". These are put into the Elasticsearch model without scores.
>
> Elasticsearch compares the similarity of the user history to each item in
> the model to find the KNN similar ones. This uses OKAPI BM25 from Lucene,
> which has several benefits over pure cosines (it actually consists of
> adjustments to cosine) and we also use norms. With ES 5 we should see
> quality improvements due to this. https://www.elastic.co/
> guide/en/elasticsearch/guide/master/pluggable-similarites.html
>
>
>
> On Nov 21, 2017, at 1:28 AM, Noelia Osés Fernández <no...@vicomtech.org>
> wrote:
>
> Pat,
>
> If I understood your explanation correctly, you say that some elements of
> PtP are removed by the LLR (set to zero, to be precise). But the elements
> that survive are calculated by matrix multiplication. The final PtP is put
> into EleasticSearc and when we query for user recommendations ES uses KNN
> to find the items (the rows in PtP) that are most similar to the user's
> history.
>
> If the non-zero elements of PtP have been calculated by straight matrix
> multiplication, and I'm assuming that the P matrix only has 0s and 1s to
> indicate which items have been purchased by which user, then the elements
> of PtP are either 0 or greater to or equal than 1. However, the scores I
> get are below 1.
>
> So is the KNN using cosine similarity as a metric to calculate the closest
> neighbours? And is the results of this cosine similarity metric what is
> returned as a 'score'?
>
> If it is, when it is greater than 1, is this because the different cosine
> similarities are added together i.e. PtP, PtL... ?
>
> Thank you for all your valuable help!
>
> On 17 November 2017 at 19:52, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
>> Mahout builds the model by doing matrix multiplication (PtP) then
>> calculating the LLR score for every non-zero value. We then keep the top K
>> or use a threshold to decide whether to keep of not (both are supported in
>> the UR). LLR is a metric for seeing how likely 2 events in a large group
>> are correlated. Therefore LLR is only used to remove weak data from the
>> model.
>>
>> So Mahout builds the model then it is put into Elasticsearch which is
>> used as a KNN (K-nearest Neighbors) engine. The LLR score is not put into
>> the model only an indicator that the item survived the LLR test.
>>
>> The KNN is applied using the user’s history as the query and finding
>> items the most closely match it. Since PtP will have items in rows and the
>> row will have correlating items, this “search” methods work quite well to
>> find items that had very similar items purchased with it as are in the
>> user’s history.
>>
>> =============================== that is the simple explanation
>> ========================================
>>
>> Item-based recs take the model items (correlated items by the LLR test)
>> as the query and the results are the most similar items—the items with most
>> similar correlating items.
>>
>> The model is items in rows and items in columns if you are only using one
>> event. PtP. If you think it through, it is all purchased items in as the
>> row key and other items purchased along with the row key. LLR filters out
>> the weakly correlating non-zero values (0 mean no evidence of correlation
>> anyway). If we didn’t do this it would be purely a “Cooccurrence”
>> recommender, one of the first useful ones. But filtering based on
>> cooccurrence strength (PtP values without LLR applied to them) produces
>> much worse results than using LLR to filter for most highly correlated
>> cooccurrences. You get a similar effect with Matrix Factorization but you
>> can only use one type of event for various reasons.
>>
>> Since LLR is a probabilistic metric that only looks at counts, it can be
>> applied equally well to PtV (purchase, view), PtS (purchase, search terms),
>> PtC (purchase, category-preferences). We did an experiment using Mean
>> Average Precision for the UR using video “Likes” vs “Likes” and “Dislikes”
>> so LtL vs. LtL and LtD scraped from rottentomatoes.com reviews and got a
>> 20% lift in the MAP@k score by including data for “Dislikes”.
>> https://developer.ibm.com/dwblog/2017/mahout-spa
>> rk-correlated-cross-occurences/
>>
>> So the benefit and use of LLR is to filter weak data from the model and
>> allow us to see if dislikes, and other events, correlate with likes. Adding
>> this type of data, that is usually thrown away is one the the most powerful
>> reasons to use the algorithm—BTW the algorithm is called Correlated
>> Cross-Occurrence (CCO).
>>
>> The benefit of using Lucene (at the heart of Elasticsearch) to do the KNN
>> query is that is it fast, taking the user’s realtime events into the query
>> but also because it is is trivial to add all sorts or business rules. like
>> give me recs based on user events but only ones from a certain category, of
>> give me recs but only ones tagged as “in-stock” in fact the business rules
>> can have inclusion rules, exclusion rules, and be mixed with ANDs and ORs.
>>
>> BTW there is a version ready for testing with PIO 0.12.0 and ES5 here:
>> https://github.com/actionml/universal-recommender/tree/0.7.0-SNAPSHOT Instructions
>> in the readme and notice it is in the 0.7.0-SNAPSHOT branch.
>>
>>
>> On Nov 17, 2017, at 7:59 AM, Andrew Troemner <at...@salesforce.com>
>> wrote:
>>
>> I'll echo Dan here. He and I went through the raw Mahout libraries called
>> by the Universal Recommender, and while Noelia's description is accurate
>> for an intermediate step, the indexing via ElasticSearch generates some
>> separate relevancy scores based on their Lucene indexing scheme. The raw
>> LLR scores are used in building this process, but the final scores served
>> up by the API's should be post-processed, and cannot be used to reconstruct
>> the raw LLR's (to my understanding).
>>
>> There are also some additional steps including down-sampling, which
>> scrubs out very rare combinations (which otherwise would have very high
>> LLR's for a single observation), which partially corrects for the
>> statistical problem of multiple detection. But the underlying logic is per
>> Ted Dunning's research and summarized by Noelia, and is a solid way to
>> approach interaction effects for tens of thousands of items and including
>> secondary indicators (like demographics, or implicit preferences).
>>
>>
>> *ANDREW TROEMNER*Associate Principal Data Scientist | salesforce.com
>> Office: 317.832.4404
>> Mobile: 317.531.0216
>>
>>
>>
>> <http://smart.salesforce.com/sig/atroemner//us_mb_kb/default/link.html>
>>
>> On Fri, Nov 17, 2017 at 9:55 AM, Daniel Gabrieli <dgabrieli@salesforce
>> .com> wrote:
>>
>>> Maybe someone can correct me if I am wrong but in the code I believe
>>> Elasticsearch is used instead of "resulting LLR is what goes into the
>>> AB element in matrix PtP or PtL."
>>>
>>> By default the strongest 50 LLR scores get set as searchable values in
>>> Elasticsearch per item-event pair.
>>>
>>> You can configure the thresholds for significance using the
>>> configuration parameters: maxCorrelatorsPerItem or minLLR.  And this
>>> configuration is important because at default of 50 you may end up treating
>>> all "indicator values" as significant.  More info here:
>>> http://actionml.com/docs/ur_config
>>>
>>>
>>>
>>> On Fri, Nov 17, 2017 at 4:50 AM Noelia Osés Fernández <
>>> noses@vicomtech.org> wrote:
>>>
>>>>
>>>> Let's see if I've understood how LLR is used in UR. Let P be the matrix
>>>> for the primary conversion indicator (say purchases) and Pt its transposed.
>>>>
>>>>
>>>> Then, with a second matrix, which can be P again to make PtP or a
>>>> matrix for a secondary indicator (say L for likes) to make PtL, we take a
>>>> row from Pt (item A) and a column from the second matrix (either P or L, in
>>>> this example) (item B) and we calculate the table that Ted Dunning explains
>>>> on his webpage: the number of coocurrences that item A *AND* B have
>>>> been purchased (or purchased AND liked), the number of times that item A
>>>>  *OR* B have been purchased (or purchased OR liked), and the number of
>>>> times that *neither* item A nor B have been purchased (or purchased or
>>>> liked). With this counts we calculate LLR following the formulas that Ted
>>>> Dunning provides and the resulting LLR is what goes into the AB element in
>>>> matrix PtP or PtL. Correct?
>>>>
>>>> Thank you!
>>>>
>>>> On 16 November 2017 at 17:03, Noelia Osés Fernández <
>>>> noses@vicomtech.org> wrote:
>>>>
>>>>> Wonderful! Thanks Daniel!
>>>>>
>>>>> Suneel, I'm still new to the Apache ecosystem and so I know that
>>>>> Mahout is used but only vaguely... I still don't know the different parts
>>>>> well enough to have a good understanding of what each of them do (Spark,
>>>>> MLLib, PIO, Mahout,...)
>>>>>
>>>>> Thank you both!
>>>>>
>>>>> On 16 November 2017 at 16:59, Suneel Marthi <sm...@apache.org> wr
>>>>> ote:
>>>>>
>>>>>> Indeed so. Ted Dunning is an Apache Mahout PMC and committer and the
>>>>>> whole idea of Search-based Recommenders stems from his work and insights.
>>>>>> If u didn't know, the PIO UR uses Apache Mahout under the hood and hence u
>>>>>> see the LLR.
>>>>>>
>>>>>> On Thu, Nov 16, 2017 at 3:49 PM, Daniel Gabrieli <
>>>>>> dgabrieli@salesforce.com> wrote:
>>>>>>
>>>>>>> I am pretty sure the LLR stuff in UR is based off of this blog post
>>>>>>> and associated paper:
>>>>>>>
>>>>>>> http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
>>>>>>>
>>>>>>> Accurate Methods for the Statistics of Surprise and Coincidence
>>>>>>> by Ted Dunning
>>>>>>>
>>>>>>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Nov 16, 2017 at 10:26 AM Noelia Osés Fernández <
>>>>>>> noses@vicomtech.org> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I've been trying to understand how the UR algorithm works and I
>>>>>>>> think I have a general idea. But I would like to have a *mathematical
>>>>>>>> description* of the step in which the LLR comes into play. In the
>>>>>>>> CCO presentations I have found it says:
>>>>>>>>
>>>>>>>> (PtP) compares column to column using
>>>>>>>> *log-likelihood based correlation test*
>>>>>>>>
>>>>>>>> However, I have searched for "log-likelihood based correlation
>>>>>>>> test" in google but no joy. All I get are explanations of the
>>>>>>>> likelihood-ratio test to compare two models.
>>>>>>>>
>>>>>>>> I would very much appreciate a math explanation of log-likelihood
>>>>>>>> based correlation test. Any pointers to papers or any other literature that
>>>>>>>> explains this specifically are much appreciated.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Noelia
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "actionml-user" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to actionml-user+unsubscribe@googlegroups.com.
>> To post to this group, send email to actionml-user@googlegroups.com.
>> To view this discussion on the web visit https://groups.google.co
>> m/d/msgid/actionml-user/CAA2BRS%2Boj%2BNYDmsNNd2mYM1ZC5CgWwC
>> 71W3%3DEhrO9qeOiKyWXA%40mail.gmail.com
>> <https://groups.google.com/d/msgid/actionml-user/CAA2BRS%2Boj%2BNYDmsNNd2mYM1ZC5CgWwC71W3%3DEhrO9qeOiKyWXA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>
>
> --
> <http://www.vicomtech.org/>
>
> Noelia Osés Fernández, PhD
> Senior Researcher |
> Investigadora Senior
>
> noses@vicomtech.org
> +[34] 943 30 92 30
> Data Intelligence for Energy and
> Industrial Processes | Inteligencia
> de Datos para Energía y Procesos
> Industriales
>
> <https://www.linkedin.com/company/vicomtech>
> <https://www.youtube.com/user/VICOMTech>
> <ht...@Vicomtech_IK4>
>
> member of:  <http://www.graphicsmedia.net/>     <http://www.ik4.es/>
>
> Legal Notice - Privacy policy
> <http://www.vicomtech.org/en/proteccion-datos>
>
> --
> You received this message because you are subscribed to the Google Groups
> "actionml-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to actionml-user+unsubscribe@googlegroups.com.
> To post to this group, send email to actionml-user@googlegroups.com.
> To view this discussion on the web visit https://groups.google.
> com/d/msgid/actionml-user/CAMyseftsnWTn3UqrS5k3SgBJFgftq
> ss6DbjLjo07FUR92HCKoA%40mail.gmail.com
> <https://groups.google.com/d/msgid/actionml-user/CAMyseftsnWTn3UqrS5k3SgBJFgftqss6DbjLjo07FUR92HCKoA%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
>


-- 
<http://www.vicomtech.org>

Noelia Osés Fernández, PhD
Senior Researcher |
Investigadora Senior

noses@vicomtech.org
+[34] 943 30 92 30
Data Intelligence for Energy and
Industrial Processes | Inteligencia
de Datos para Energía y Procesos
Industriales

<https://www.linkedin.com/company/vicomtech>
<https://www.youtube.com/user/VICOMTech>
<ht...@Vicomtech_IK4>

member of:  <http://www.graphicsmedia.net/>     <http://www.ik4.es>

Legal Notice - Privacy policy <http://www.vicomtech.org/en/proteccion-datos>

Re: Log-likelihood based correlation test?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

No PtP non-zero elements have LLR calculated. The highest scores in the row are kept, or ones above some threshold hte resst are removeda as “noise". These are put into the Elasticsearch model without scores. 

Elasticsearch compares the similarity of the user history to each item in the model to find the KNN similar ones. This uses OKAPI BM25 from Lucene, which has several benefits over pure cosines (it actually consists of adjustments to cosine) and we also use norms. With ES 5 we should see quality improvements due to this. https://www.elastic.co/guide/en/elasticsearch/guide/master/pluggable-similarites.html <https://www.elastic.co/guide/en/elasticsearch/guide/master/pluggable-similarites.html>



On Nov 21, 2017, at 1:28 AM, Noelia Osés Fernández <no...@vicomtech.org> wrote:

Pat,

If I understood your explanation correctly, you say that some elements of PtP are removed by the LLR (set to zero, to be precise). But the elements that survive are calculated by matrix multiplication. The final PtP is put into EleasticSearc and when we query for user recommendations ES uses KNN to find the items (the rows in PtP) that are most similar to the user's history.

If the non-zero elements of PtP have been calculated by straight matrix multiplication, and I'm assuming that the P matrix only has 0s and 1s to indicate which items have been purchased by which user, then the elements of PtP are either 0 or greater to or equal than 1. However, the scores I get are below 1.

So is the KNN using cosine similarity as a metric to calculate the closest neighbours? And is the results of this cosine similarity metric what is returned as a 'score'?

If it is, when it is greater than 1, is this because the different cosine similarities are added together i.e. PtP, PtL... ?

Thank you for all your valuable help!

On 17 November 2017 at 19:52, Pat Ferrel <pat@occamsmachete.com <ma...@occamsmachete.com>> wrote:
Mahout builds the model by doing matrix multiplication (PtP) then calculating the LLR score for every non-zero value. We then keep the top K or use a threshold to decide whether to keep of not (both are supported in the UR). LLR is a metric for seeing how likely 2 events in a large group are correlated. Therefore LLR is only used to remove weak data from the model.

So Mahout builds the model then it is put into Elasticsearch which is used as a KNN (K-nearest Neighbors) engine. The LLR score is not put into the model only an indicator that the item survived the LLR test.

The KNN is applied using the user’s history as the query and finding items the most closely match it. Since PtP will have items in rows and the row will have correlating items, this “search” methods work quite well to find items that had very similar items purchased with it as are in the user’s history.

=============================== that is the simple explanation ========================================

Item-based recs take the model items (correlated items by the LLR test) as the query and the results are the most similar items—the items with most similar correlating items.

The model is items in rows and items in columns if you are only using one event. PtP. If you think it through, it is all purchased items in as the row key and other items purchased along with the row key. LLR filters out the weakly correlating non-zero values (0 mean no evidence of correlation anyway). If we didn’t do this it would be purely a “Cooccurrence” recommender, one of the first useful ones. But filtering based on cooccurrence strength (PtP values without LLR applied to them) produces much worse results than using LLR to filter for most highly correlated cooccurrences. You get a similar effect with Matrix Factorization but you can only use one type of event for various reasons.

Since LLR is a probabilistic metric that only looks at counts, it can be applied equally well to PtV (purchase, view), PtS (purchase, search terms), PtC (purchase, category-preferences). We did an experiment using Mean Average Precision for the UR using video “Likes” vs “Likes” and “Dislikes” so LtL vs. LtL and LtD scraped from rottentomatoes.com <http://rottentomatoes.com/> reviews and got a 20% lift in the MAP@k score by including data for “Dislikes”. https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/ <https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/>

So the benefit and use of LLR is to filter weak data from the model and allow us to see if dislikes, and other events, correlate with likes. Adding this type of data, that is usually thrown away is one the the most powerful reasons to use the algorithm—BTW the algorithm is called Correlated Cross-Occurrence (CCO).

The benefit of using Lucene (at the heart of Elasticsearch) to do the KNN query is that is it fast, taking the user’s realtime events into the query but also because it is is trivial to add all sorts or business rules. like give me recs based on user events but only ones from a certain category, of give me recs but only ones tagged as “in-stock” in fact the business rules can have inclusion rules, exclusion rules, and be mixed with ANDs and ORs.

BTW there is a version ready for testing with PIO 0.12.0 and ES5 here: https://github.com/actionml/universal-recommender/tree/0.7.0-SNAPSHOT <https://github.com/actionml/universal-recommender/tree/0.7.0-SNAPSHOT> Instructions in the readme and notice it is in the 0.7.0-SNAPSHOT branch.


On Nov 17, 2017, at 7:59 AM, Andrew Troemner <atroemner@salesforce.com <ma...@salesforce.com>> wrote:

I'll echo Dan here. He and I went through the raw Mahout libraries called by the Universal Recommender, and while Noelia's description is accurate for an intermediate step, the indexing via ElasticSearch generates some separate relevancy scores based on their Lucene indexing scheme. The raw LLR scores are used in building this process, but the final scores served up by the API's should be post-processed, and cannot be used to reconstruct the raw LLR's (to my understanding).

There are also some additional steps including down-sampling, which scrubs out very rare combinations (which otherwise would have very high LLR's for a single observation), which partially corrects for the statistical problem of multiple detection. But the underlying logic is per Ted Dunning's research and summarized by Noelia, and is a solid way to approach interaction effects for tens of thousands of items and including secondary indicators (like demographics, or implicit preferences).

ANDREW TROEMNER
Associate Principal Data Scientist | salesforce.com <http://salesforce.com/>
Office: 317.832.4404
Mobile: 317.531.0216




 <http://smart.salesforce.com/sig/atroemner//us_mb_kb/default/link.html>
On Fri, Nov 17, 2017 at 9:55 AM, Daniel Gabrieli <dgabrieli@salesforce.com <ma...@salesforce.com>> wrote:
Maybe someone can correct me if I am wrong but in the code I believe Elasticsearch is used instead of "resulting LLR is what goes into the AB element in matrix PtP or PtL."

By default the strongest 50 LLR scores get set as searchable values in Elasticsearch per item-event pair.

You can configure the thresholds for significance using the configuration parameters: maxCorrelatorsPerItem or minLLR.  And this configuration is important because at default of 50 you may end up treating all "indicator values" as significant.  More info here: http://actionml.com/docs/ur_config <http://actionml.com/docs/ur_config>



On Fri, Nov 17, 2017 at 4:50 AM Noelia Osés Fernández <noses@vicomtech.org <ma...@vicomtech.org>> wrote:

Let's see if I've understood how LLR is used in UR. Let P be the matrix for the primary conversion indicator (say purchases) and Pt its transposed. 

Then, with a second matrix, which can be P again to make PtP or a matrix for a secondary indicator (say L for likes) to make PtL, we take a row from Pt (item A) and a column from the second matrix (either P or L, in this example) (item B) and we calculate the table that Ted Dunning explains on his webpage: the number of coocurrences that item A AND B have been purchased (or purchased AND liked), the number of times that item A OR B have been purchased (or purchased OR liked), and the number of times that neither item A nor B have been purchased (or purchased or liked). With this counts we calculate LLR following the formulas that Ted Dunning provides and the resulting LLR is what goes into the AB element in matrix PtP or PtL. Correct?   

Thank you!

On 16 November 2017 at 17:03, Noelia Osés Fernández <noses@vicomtech.org <ma...@vicomtech.org>> wrote:
Wonderful! Thanks Daniel!

Suneel, I'm still new to the Apache ecosystem and so I know that Mahout is used but only vaguely... I still don't know the different parts well enough to have a good understanding of what each of them do (Spark, MLLib, PIO, Mahout,...)

Thank you both!

On 16 November 2017 at 16:59, Suneel Marthi <smarthi@apache.org <ma...@apache.org>> wrote:
Indeed so. Ted Dunning is an Apache Mahout PMC and committer and the whole idea of Search-based Recommenders stems from his work and insights.  If u didn't know, the PIO UR uses Apache Mahout under the hood and hence u see the LLR.

On Thu, Nov 16, 2017 at 3:49 PM, Daniel Gabrieli <dgabrieli@salesforce.com <ma...@salesforce.com>> wrote:
I am pretty sure the LLR stuff in UR is based off of this blog post and associated paper:

http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html <http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html>

Accurate Methods for the Statistics of Surprise and Coincidence
by Ted Dunning

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962 <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962>


On Thu, Nov 16, 2017 at 10:26 AM Noelia Osés Fernández <noses@vicomtech.org <ma...@vicomtech.org>> wrote:
Hi,

I've been trying to understand how the UR algorithm works and I think I have a general idea. But I would like to have a mathematical description of the step in which the LLR comes into play. In the CCO presentations I have found it says:

(PtP) compares column to column using log-likelihood based correlation test


However, I have searched for "log-likelihood based correlation test" in google but no joy. All I get are explanations of the likelihood-ratio test to compare two models. 

I would very much appreciate a math explanation of log-likelihood based correlation test. Any pointers to papers or any other literature that explains this specifically are much appreciated.

Best regards,
Noelia












-- 
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-user+unsubscribe@googlegroups.com <ma...@googlegroups.com>.
To post to this group, send email to actionml-user@googlegroups.com <ma...@googlegroups.com>.
To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/CAA2BRS%2Boj%2BNYDmsNNd2mYM1ZC5CgWwC71W3%3DEhrO9qeOiKyWXA%40mail.gmail.com <https://groups.google.com/d/msgid/actionml-user/CAA2BRS%2Boj%2BNYDmsNNd2mYM1ZC5CgWwC71W3%3DEhrO9qeOiKyWXA%40mail.gmail.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.




-- 
 <http://www.vicomtech.org/>

Noelia Osés Fernández, PhD
Senior Researcher |
Investigadora Senior

noses@vicomtech.org <ma...@vicomtech.org>
+[34] 943 30 92 30
Data Intelligence for Energy and
Industrial Processes | Inteligencia
de Datos para Energía y Procesos
Industriales

 <https://www.linkedin.com/company/vicomtech>  <https://www.youtube.com/user/VICOMTech>  <ht...@Vicomtech_IK4>

member of:  <http://www.graphicsmedia.net/>     <http://www.ik4.es/>

Legal Notice - Privacy policy <http://www.vicomtech.org/en/proteccion-datos>

-- 
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-user+unsubscribe@googlegroups.com <ma...@googlegroups.com>.
To post to this group, send email to actionml-user@googlegroups.com <ma...@googlegroups.com>.
To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/CAMyseftsnWTn3UqrS5k3SgBJFgftqss6DbjLjo07FUR92HCKoA%40mail.gmail.com <https://groups.google.com/d/msgid/actionml-user/CAMyseftsnWTn3UqrS5k3SgBJFgftqss6DbjLjo07FUR92HCKoA%40mail.gmail.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.

Re: Log-likelihood based correlation test?

Posted by Noelia Osés Fernández <no...@vicomtech.org>.

Pat,

If I understood your explanation correctly, you say that some elements of
PtP are removed by the LLR (set to zero, to be precise). But the elements
that survive are calculated by matrix multiplication. The final PtP is put
into EleasticSearc and when we query for user recommendations ES uses KNN
to find the items (the rows in PtP) that are most similar to the user's
history.

If the non-zero elements of PtP have been calculated by straight matrix
multiplication, and I'm assuming that the P matrix only has 0s and 1s to
indicate which items have been purchased by which user, then the elements
of PtP are either 0 or greater to or equal than 1. However, the scores I
get are below 1.

So is the KNN using cosine similarity as a metric to calculate the closest
neighbours? And is the results of this cosine similarity metric what is
returned as a 'score'?

If it is, when it is greater than 1, is this because the different cosine
similarities are added together i.e. PtP, PtL... ?

Thank you for all your valuable help!

On 17 November 2017 at 19:52, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Mahout builds the model by doing matrix multiplication (PtP) then
> calculating the LLR score for every non-zero value. We then keep the top K
> or use a threshold to decide whether to keep of not (both are supported in
> the UR). LLR is a metric for seeing how likely 2 events in a large group
> are correlated. Therefore LLR is only used to remove weak data from the
> model.
>
> So Mahout builds the model then it is put into Elasticsearch which is used
> as a KNN (K-nearest Neighbors) engine. The LLR score is not put into the
> model only an indicator that the item survived the LLR test.
>
> The KNN is applied using the user’s history as the query and finding items
> the most closely match it. Since PtP will have items in rows and the row
> will have correlating items, this “search” methods work quite well to find
> items that had very similar items purchased with it as are in the user’s
> history.
>
> =============================== that is the simple explanation
> ========================================
>
> Item-based recs take the model items (correlated items by the LLR test) as
> the query and the results are the most similar items—the items with most
> similar correlating items.
>
> The model is items in rows and items in columns if you are only using one
> event. PtP. If you think it through, it is all purchased items in as the
> row key and other items purchased along with the row key. LLR filters out
> the weakly correlating non-zero values (0 mean no evidence of correlation
> anyway). If we didn’t do this it would be purely a “Cooccurrence”
> recommender, one of the first useful ones. But filtering based on
> cooccurrence strength (PtP values without LLR applied to them) produces
> much worse results than using LLR to filter for most highly correlated
> cooccurrences. You get a similar effect with Matrix Factorization but you
> can only use one type of event for various reasons.
>
> Since LLR is a probabilistic metric that only looks at counts, it can be
> applied equally well to PtV (purchase, view), PtS (purchase, search terms),
> PtC (purchase, category-preferences). We did an experiment using Mean
> Average Precision for the UR using video “Likes” vs “Likes” and “Dislikes”
> so LtL vs. LtL and LtD scraped from rottentomatoes.com reviews and got a
> 20% lift in the MAP@k score by including data for “Dislikes”.
> https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-
> occurences/
>
> So the benefit and use of LLR is to filter weak data from the model and
> allow us to see if dislikes, and other events, correlate with likes. Adding
> this type of data, that is usually thrown away is one the the most powerful
> reasons to use the algorithm—BTW the algorithm is called Correlated
> Cross-Occurrence (CCO).
>
> The benefit of using Lucene (at the heart of Elasticsearch) to do the KNN
> query is that is it fast, taking the user’s realtime events into the query
> but also because it is is trivial to add all sorts or business rules. like
> give me recs based on user events but only ones from a certain category, of
> give me recs but only ones tagged as “in-stock” in fact the business rules
> can have inclusion rules, exclusion rules, and be mixed with ANDs and ORs.
>
> BTW there is a version ready for testing with PIO 0.12.0 and ES5 here:
> https://github.com/actionml/universal-recommender/tree/0.7.0-SNAPSHOT Instructions
> in the readme and notice it is in the 0.7.0-SNAPSHOT branch.
>
>
> On Nov 17, 2017, at 7:59 AM, Andrew Troemner <at...@salesforce.com>
> wrote:
>
> I'll echo Dan here. He and I went through the raw Mahout libraries called
> by the Universal Recommender, and while Noelia's description is accurate
> for an intermediate step, the indexing via ElasticSearch generates some
> separate relevancy scores based on their Lucene indexing scheme. The raw
> LLR scores are used in building this process, but the final scores served
> up by the API's should be post-processed, and cannot be used to reconstruct
> the raw LLR's (to my understanding).
>
> There are also some additional steps including down-sampling, which scrubs
> out very rare combinations (which otherwise would have very high LLR's for
> a single observation), which partially corrects for the statistical problem
> of multiple detection. But the underlying logic is per Ted Dunning's
> research and summarized by Noelia, and is a solid way to approach
> interaction effects for tens of thousands of items and including secondary
> indicators (like demographics, or implicit preferences).
>
>
> *ANDREW TROEMNER*Associate Principal Data Scientist | salesforce.com
> Office: 317.832.4404
> Mobile: 317.531.0216
>
>
>
> <http://smart.salesforce.com/sig/atroemner//us_mb_kb/default/link.html>
>
> On Fri, Nov 17, 2017 at 9:55 AM, Daniel Gabrieli <dgabrieli@salesforce.com
> > wrote:
>
>> Maybe someone can correct me if I am wrong but in the code I believe
>> Elasticsearch is used instead of "resulting LLR is what goes into the AB
>> element in matrix PtP or PtL."
>>
>> By default the strongest 50 LLR scores get set as searchable values in
>> Elasticsearch per item-event pair.
>>
>> You can configure the thresholds for significance using the configuration
>> parameters: maxCorrelatorsPerItem or minLLR.  And this configuration is
>> important because at default of 50 you may end up treating all "indicator
>> values" as significant.  More info here: http://actionml.com/docs
>> /ur_config
>>
>>
>>
>> On Fri, Nov 17, 2017 at 4:50 AM Noelia Osés Fernández <
>> noses@vicomtech.org> wrote:
>>
>>>
>>> Let's see if I've understood how LLR is used in UR. Let P be the matrix
>>> for the primary conversion indicator (say purchases) and Pt its transposed.
>>>
>>>
>>> Then, with a second matrix, which can be P again to make PtP or a matrix
>>> for a secondary indicator (say L for likes) to make PtL, we take a row from
>>> Pt (item A) and a column from the second matrix (either P or L, in this
>>> example) (item B) and we calculate the table that Ted Dunning explains on
>>> his webpage: the number of coocurrences that item A *AND* B have been
>>> purchased (or purchased AND liked), the number of times that item A *OR*
>>>  B have been purchased (or purchased OR liked), and the number of times
>>> that *neither* item A nor B have been purchased (or purchased or
>>> liked). With this counts we calculate LLR following the formulas that Ted
>>> Dunning provides and the resulting LLR is what goes into the AB element in
>>> matrix PtP or PtL. Correct?
>>>
>>> Thank you!
>>>
>>> On 16 November 2017 at 17:03, Noelia Osés Fernández <noses@vicomtech.org
>>> > wrote:
>>>
>>>> Wonderful! Thanks Daniel!
>>>>
>>>> Suneel, I'm still new to the Apache ecosystem and so I know that Mahout
>>>> is used but only vaguely... I still don't know the different parts well
>>>> enough to have a good understanding of what each of them do (Spark, MLLib,
>>>> PIO, Mahout,...)
>>>>
>>>> Thank you both!
>>>>
>>>> On 16 November 2017 at 16:59, Suneel Marthi <sm...@apache.org> wrote:
>>>>
>>>>> Indeed so. Ted Dunning is an Apache Mahout PMC and committer and the
>>>>> whole idea of Search-based Recommenders stems from his work and insights.
>>>>> If u didn't know, the PIO UR uses Apache Mahout under the hood and hence u
>>>>> see the LLR.
>>>>>
>>>>> On Thu, Nov 16, 2017 at 3:49 PM, Daniel Gabrieli <dgabrieli@
>>>>> salesforce.com> wrote:
>>>>>
>>>>>> I am pretty sure the LLR stuff in UR is based off of this blog post
>>>>>> and associated paper:
>>>>>>
>>>>>> http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
>>>>>>
>>>>>> Accurate Methods for the Statistics of Surprise and Coincidence
>>>>>> by Ted Dunning
>>>>>>
>>>>>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962
>>>>>>
>>>>>>
>>>>>> On Thu, Nov 16, 2017 at 10:26 AM Noelia Osés Fernández <
>>>>>> noses@vicomtech.org> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I've been trying to understand how the UR algorithm works and I
>>>>>>> think I have a general idea. But I would like to have a *mathematical
>>>>>>> description* of the step in which the LLR comes into play. In the
>>>>>>> CCO presentations I have found it says:
>>>>>>>
>>>>>>> (PtP) compares column to column using
>>>>>>> *log-likelihood based correlation test*
>>>>>>>
>>>>>>> However, I have searched for "log-likelihood based correlation test"
>>>>>>> in google but no joy. All I get are explanations of the likelihood-ratio
>>>>>>> test to compare two models.
>>>>>>>
>>>>>>> I would very much appreciate a math explanation of log-likelihood
>>>>>>> based correlation test. Any pointers to papers or any other literature that
>>>>>>> explains this specifically are much appreciated.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Noelia
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "actionml-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to actionml-user+unsubscribe@googlegroups.com.
> To post to this group, send email to actionml-user@googlegroups.com.
> To view this discussion on the web visit https://groups.google.
> com/d/msgid/actionml-user/CAA2BRS%2Boj%2BNYDmsNNd2mYM1ZC5CgWwC71W3%
> 3DEhrO9qeOiKyWXA%40mail.gmail.com
> <https://groups.google.com/d/msgid/actionml-user/CAA2BRS%2Boj%2BNYDmsNNd2mYM1ZC5CgWwC71W3%3DEhrO9qeOiKyWXA%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
>


-- 
<http://www.vicomtech.org>

Noelia Osés Fernández, PhD
Senior Researcher |
Investigadora Senior

noses@vicomtech.org
+[34] 943 30 92 30
Data Intelligence for Energy and
Industrial Processes | Inteligencia
de Datos para Energía y Procesos
Industriales

<https://www.linkedin.com/company/vicomtech>
<https://www.youtube.com/user/VICOMTech>
<ht...@Vicomtech_IK4>

member of:  <http://www.graphicsmedia.net/>     <http://www.ik4.es>

Legal Notice - Privacy policy <http://www.vicomtech.org/en/proteccion-datos>

Re: Log-likelihood based correlation test?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Mahout builds the model by doing matrix multiplication (PtP) then calculating the LLR score for every non-zero value. We then keep the top K or use a threshold to decide whether to keep of not (both are supported in the UR). LLR is a metric for seeing how likely 2 events in a large group are correlated. Therefore LLR is only used to remove weak data from the model.

So Mahout builds the model then it is put into Elasticsearch which is used as a KNN (K-nearest Neighbors) engine. The LLR score is not put into the model only an indicator that the item survived the LLR test.

The KNN is applied using the user’s history as the query and finding items the most closely match it. Since PtP will have items in rows and the row will have correlating items, this “search” methods work quite well to find items that had very similar items purchased with it as are in the user’s history.

=============================== that is the simple explanation ========================================

Item-based recs take the model items (correlated items by the LLR test) as the query and the results are the most similar items—the items with most similar correlating items.

The model is items in rows and items in columns if you are only using one event. PtP. If you think it through, it is all purchased items in as the row key and other items purchased along with the row key. LLR filters out the weakly correlating non-zero values (0 mean no evidence of correlation anyway). If we didn’t do this it would be purely a “Cooccurrence” recommender, one of the first useful ones. But filtering based on cooccurrence strength (PtP values without LLR applied to them) produces much worse results than using LLR to filter for most highly correlated cooccurrences. You get a similar effect with Matrix Factorization but you can only use one type of event for various reasons.

Since LLR is a probabilistic metric that only looks at counts, it can be applied equally well to PtV (purchase, view), PtS (purchase, search terms), PtC (purchase, category-preferences). We did an experiment using Mean Average Precision for the UR using video “Likes” vs “Likes” and “Dislikes” so LtL vs. LtL and LtD scraped from rottentomatoes.com reviews and got a 20% lift in the MAP@k score by including data for “Dislikes”. https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/ <https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/>

So the benefit and use of LLR is to filter weak data from the model and allow us to see if dislikes, and other events, correlate with likes. Adding this type of data, that is usually thrown away is one the the most powerful reasons to use the algorithm—BTW the algorithm is called Correlated Cross-Occurrence (CCO).

The benefit of using Lucene (at the heart of Elasticsearch) to do the KNN query is that is it fast, taking the user’s realtime events into the query but also because it is is trivial to add all sorts or business rules. like give me recs based on user events but only ones from a certain category, of give me recs but only ones tagged as “in-stock” in fact the business rules can have inclusion rules, exclusion rules, and be mixed with ANDs and ORs.

BTW there is a version ready for testing with PIO 0.12.0 and ES5 here: https://github.com/actionml/universal-recommender/tree/0.7.0-SNAPSHOT <https://github.com/actionml/universal-recommender/tree/0.7.0-SNAPSHOT> Instructions in the readme and notice it is in the 0.7.0-SNAPSHOT branch.


On Nov 17, 2017, at 7:59 AM, Andrew Troemner <at...@salesforce.com> wrote:

I'll echo Dan here. He and I went through the raw Mahout libraries called by the Universal Recommender, and while Noelia's description is accurate for an intermediate step, the indexing via ElasticSearch generates some separate relevancy scores based on their Lucene indexing scheme. The raw LLR scores are used in building this process, but the final scores served up by the API's should be post-processed, and cannot be used to reconstruct the raw LLR's (to my understanding).

There are also some additional steps including down-sampling, which scrubs out very rare combinations (which otherwise would have very high LLR's for a single observation), which partially corrects for the statistical problem of multiple detection. But the underlying logic is per Ted Dunning's research and summarized by Noelia, and is a solid way to approach interaction effects for tens of thousands of items and including secondary indicators (like demographics, or implicit preferences).

ANDREW TROEMNER
Associate Principal Data Scientist | salesforce.com <http://salesforce.com/>
Office: 317.832.4404
Mobile: 317.531.0216




 <http://smart.salesforce.com/sig/atroemner//us_mb_kb/default/link.html>
On Fri, Nov 17, 2017 at 9:55 AM, Daniel Gabrieli <dgabrieli@salesforce.com <ma...@salesforce.com>> wrote:
Maybe someone can correct me if I am wrong but in the code I believe Elasticsearch is used instead of "resulting LLR is what goes into the AB element in matrix PtP or PtL."

By default the strongest 50 LLR scores get set as searchable values in Elasticsearch per item-event pair.

You can configure the thresholds for significance using the configuration parameters: maxCorrelatorsPerItem or minLLR.  And this configuration is important because at default of 50 you may end up treating all "indicator values" as significant.  More info here: http://actionml.com/docs/ur_config <http://actionml.com/docs/ur_config>



On Fri, Nov 17, 2017 at 4:50 AM Noelia Osés Fernández <noses@vicomtech.org <ma...@vicomtech.org>> wrote:

Let's see if I've understood how LLR is used in UR. Let P be the matrix for the primary conversion indicator (say purchases) and Pt its transposed. 

Then, with a second matrix, which can be P again to make PtP or a matrix for a secondary indicator (say L for likes) to make PtL, we take a row from Pt (item A) and a column from the second matrix (either P or L, in this example) (item B) and we calculate the table that Ted Dunning explains on his webpage: the number of coocurrences that item A AND B have been purchased (or purchased AND liked), the number of times that item A OR B have been purchased (or purchased OR liked), and the number of times that neither item A nor B have been purchased (or purchased or liked). With this counts we calculate LLR following the formulas that Ted Dunning provides and the resulting LLR is what goes into the AB element in matrix PtP or PtL. Correct?   

Thank you!

On 16 November 2017 at 17:03, Noelia Osés Fernández <noses@vicomtech.org <ma...@vicomtech.org>> wrote:
Wonderful! Thanks Daniel!

Suneel, I'm still new to the Apache ecosystem and so I know that Mahout is used but only vaguely... I still don't know the different parts well enough to have a good understanding of what each of them do (Spark, MLLib, PIO, Mahout,...)

Thank you both!

On 16 November 2017 at 16:59, Suneel Marthi <smarthi@apache.org <ma...@apache.org>> wrote:
Indeed so. Ted Dunning is an Apache Mahout PMC and committer and the whole idea of Search-based Recommenders stems from his work and insights.  If u didn't know, the PIO UR uses Apache Mahout under the hood and hence u see the LLR.

On Thu, Nov 16, 2017 at 3:49 PM, Daniel Gabrieli <dgabrieli@salesforce.com <ma...@salesforce.com>> wrote:
I am pretty sure the LLR stuff in UR is based off of this blog post and associated paper:

http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html <http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html>

Accurate Methods for the Statistics of Surprise and Coincidence
by Ted Dunning

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962 <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962>


On Thu, Nov 16, 2017 at 10:26 AM Noelia Osés Fernández <noses@vicomtech.org <ma...@vicomtech.org>> wrote:
Hi,

I've been trying to understand how the UR algorithm works and I think I have a general idea. But I would like to have a mathematical description of the step in which the LLR comes into play. In the CCO presentations I have found it says:

(PtP) compares column to column using log-likelihood based correlation test


However, I have searched for "log-likelihood based correlation test" in google but no joy. All I get are explanations of the likelihood-ratio test to compare two models. 

I would very much appreciate a math explanation of log-likelihood based correlation test. Any pointers to papers or any other literature that explains this specifically are much appreciated.

Best regards,
Noelia












-- 
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-user+unsubscribe@googlegroups.com <ma...@googlegroups.com>.
To post to this group, send email to actionml-user@googlegroups.com <ma...@googlegroups.com>.
To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/CAA2BRS%2Boj%2BNYDmsNNd2mYM1ZC5CgWwC71W3%3DEhrO9qeOiKyWXA%40mail.gmail.com <https://groups.google.com/d/msgid/actionml-user/CAA2BRS%2Boj%2BNYDmsNNd2mYM1ZC5CgWwC71W3%3DEhrO9qeOiKyWXA%40mail.gmail.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.

Re: Log-likelihood based correlation test?

Posted by Andrew Troemner <at...@salesforce.com>.

I'll echo Dan here. He and I went through the raw Mahout libraries called
by the Universal Recommender, and while Noelia's description is accurate
for an intermediate step, the indexing via ElasticSearch generates some
separate relevancy scores based on their Lucene indexing scheme. The raw
LLR scores are used in building this process, but the final scores served
up by the API's should be post-processed, and cannot be used to reconstruct
the raw LLR's (to my understanding).

There are also some additional steps including down-sampling, which scrubs
out very rare combinations (which otherwise would have very high LLR's for
a single observation), which partially corrects for the statistical problem
of multiple detection. But the underlying logic is per Ted Dunning's
research and summarized by Noelia, and is a solid way to approach
interaction effects for tens of thousands of items and including secondary
indicators (like demographics, or implicit preferences).


*ANDREW TROEMNER*Associate Principal Data Scientist | salesforce.com
Office: 317.832.4404
Mobile: 317.531.0216



<http://smart.salesforce.com/sig/atroemner//us_mb_kb/default/link.html>

On Fri, Nov 17, 2017 at 9:55 AM, Daniel Gabrieli <dg...@salesforce.com>
wrote:

> Maybe someone can correct me if I am wrong but in the code I believe
> Elasticsearch is used instead of "resulting LLR is what goes into the AB
> element in matrix PtP or PtL."
>
> By default the strongest 50 LLR scores get set as searchable values in
> Elasticsearch per item-event pair.
>
> You can configure the thresholds for significance using the configuration
> parameters: maxCorrelatorsPerItem or minLLR.  And this configuration is
> important because at default of 50 you may end up treating all "indicator
> values" as significant.  More info here: http://actionml.com/
> docs/ur_config
>
>
>
> On Fri, Nov 17, 2017 at 4:50 AM Noelia Osés Fernández <no...@vicomtech.org>
> wrote:
>
>>
>> Let's see if I've understood how LLR is used in UR. Let P be the matrix
>> for the primary conversion indicator (say purchases) and Pt its transposed.
>>
>> Then, with a second matrix, which can be P again to make PtP or a matrix
>> for a secondary indicator (say L for likes) to make PtL, we take a row from
>> Pt (item A) and a column from the second matrix (either P or L, in this
>> example) (item B) and we calculate the table that Ted Dunning explains on
>> his webpage: the number of coocurrences that item A *AND* B have been
>> purchased (or purchased AND liked), the number of times that item A *OR*
>> B have been purchased (or purchased OR liked), and the number of times that
>> *neither* item A nor B have been purchased (or purchased or liked). With
>> this counts we calculate LLR following the formulas that Ted Dunning
>> provides and the resulting LLR is what goes into the AB element in matrix
>> PtP or PtL. Correct?
>>
>> Thank you!
>>
>> On 16 November 2017 at 17:03, Noelia Osés Fernández <no...@vicomtech.org>
>> wrote:
>>
>>> Wonderful! Thanks Daniel!
>>>
>>> Suneel, I'm still new to the Apache ecosystem and so I know that Mahout
>>> is used but only vaguely... I still don't know the different parts well
>>> enough to have a good understanding of what each of them do (Spark, MLLib,
>>> PIO, Mahout,...)
>>>
>>> Thank you both!
>>>
>>> On 16 November 2017 at 16:59, Suneel Marthi <sm...@apache.org> wrote:
>>>
>>>> Indeed so. Ted Dunning is an Apache Mahout PMC and committer and the
>>>> whole idea of Search-based Recommenders stems from his work and insights.
>>>> If u didn't know, the PIO UR uses Apache Mahout under the hood and hence u
>>>> see the LLR.
>>>>
>>>> On Thu, Nov 16, 2017 at 3:49 PM, Daniel Gabrieli <
>>>> dgabrieli@salesforce.com> wrote:
>>>>
>>>>> I am pretty sure the LLR stuff in UR is based off of this blog post
>>>>> and associated paper:
>>>>>
>>>>> http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
>>>>>
>>>>> Accurate Methods for the Statistics of Surprise and Coincidence
>>>>> by Ted Dunning
>>>>>
>>>>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962
>>>>>
>>>>>
>>>>> On Thu, Nov 16, 2017 at 10:26 AM Noelia Osés Fernández <
>>>>> noses@vicomtech.org> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I've been trying to understand how the UR algorithm works and I think
>>>>>> I have a general idea. But I would like to have a *mathematical
>>>>>> description* of the step in which the LLR comes into play. In the
>>>>>> CCO presentations I have found it says:
>>>>>>
>>>>>> (PtP) compares column to column using
>>>>>> *log-likelihood based correlation test*
>>>>>>
>>>>>> However, I have searched for "log-likelihood based correlation test"
>>>>>> in google but no joy. All I get are explanations of the likelihood-ratio
>>>>>> test to compare two models.
>>>>>>
>>>>>> I would very much appreciate a math explanation of log-likelihood
>>>>>> based correlation test. Any pointers to papers or any other literature that
>>>>>> explains this specifically are much appreciated.
>>>>>>
>>>>>> Best regards,
>>>>>> Noelia
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>>
>>
>>
>>
>>
>>

Re: Log-likelihood based correlation test?

Posted by Daniel Gabrieli <dg...@salesforce.com>.

Maybe someone can correct me if I am wrong but in the code I believe
Elasticsearch is used instead of "resulting LLR is what goes into the AB
element in matrix PtP or PtL."

By default the strongest 50 LLR scores get set as searchable values in
Elasticsearch per item-event pair.

You can configure the thresholds for significance using the configuration
parameters: maxCorrelatorsPerItem or minLLR.  And this configuration is
important because at default of 50 you may end up treating all "indicator
values" as significant.  More info here: http://actionml.com/docs/ur_config



On Fri, Nov 17, 2017 at 4:50 AM Noelia Osés Fernández <no...@vicomtech.org>
wrote:

>
> Let's see if I've understood how LLR is used in UR. Let P be the matrix
> for the primary conversion indicator (say purchases) and Pt its transposed.
>
> Then, with a second matrix, which can be P again to make PtP or a matrix
> for a secondary indicator (say L for likes) to make PtL, we take a row from
> Pt (item A) and a column from the second matrix (either P or L, in this
> example) (item B) and we calculate the table that Ted Dunning explains on
> his webpage: the number of coocurrences that item A *AND* B have been
> purchased (or purchased AND liked), the number of times that item A *OR*
> B have been purchased (or purchased OR liked), and the number of times that
> *neither* item A nor B have been purchased (or purchased or liked). With
> this counts we calculate LLR following the formulas that Ted Dunning
> provides and the resulting LLR is what goes into the AB element in matrix
> PtP or PtL. Correct?
>
> Thank you!
>
> On 16 November 2017 at 17:03, Noelia Osés Fernández <no...@vicomtech.org>
> wrote:
>
>> Wonderful! Thanks Daniel!
>>
>> Suneel, I'm still new to the Apache ecosystem and so I know that Mahout
>> is used but only vaguely... I still don't know the different parts well
>> enough to have a good understanding of what each of them do (Spark, MLLib,
>> PIO, Mahout,...)
>>
>> Thank you both!
>>
>> On 16 November 2017 at 16:59, Suneel Marthi <sm...@apache.org> wrote:
>>
>>> Indeed so. Ted Dunning is an Apache Mahout PMC and committer and the
>>> whole idea of Search-based Recommenders stems from his work and insights.
>>> If u didn't know, the PIO UR uses Apache Mahout under the hood and hence u
>>> see the LLR.
>>>
>>> On Thu, Nov 16, 2017 at 3:49 PM, Daniel Gabrieli <
>>> dgabrieli@salesforce.com> wrote:
>>>
>>>> I am pretty sure the LLR stuff in UR is based off of this blog post and
>>>> associated paper:
>>>>
>>>> http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
>>>>
>>>> Accurate Methods for the Statistics of Surprise and Coincidence
>>>> by Ted Dunning
>>>>
>>>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962
>>>>
>>>>
>>>> On Thu, Nov 16, 2017 at 10:26 AM Noelia Osés Fernández <
>>>> noses@vicomtech.org> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I've been trying to understand how the UR algorithm works and I think
>>>>> I have a general idea. But I would like to have a *mathematical
>>>>> description* of the step in which the LLR comes into play. In the CCO
>>>>> presentations I have found it says:
>>>>>
>>>>> (PtP) compares column to column using
>>>>> *log-likelihood based correlation test*
>>>>>
>>>>> However, I have searched for "log-likelihood based correlation test"
>>>>> in google but no joy. All I get are explanations of the likelihood-ratio
>>>>> test to compare two models.
>>>>>
>>>>> I would very much appreciate a math explanation of log-likelihood
>>>>> based correlation test. Any pointers to papers or any other literature that
>>>>> explains this specifically are much appreciated.
>>>>>
>>>>> Best regards,
>>>>> Noelia
>>>>>
>>>>
>>>
>>
>>
>
>
>
>
>
>
>

Re: Log-likelihood based correlation test?

Posted by Noelia Osés Fernández <no...@vicomtech.org>.

Let's see if I've understood how LLR is used in UR. Let P be the matrix for
the primary conversion indicator (say purchases) and Pt its transposed.

Then, with a second matrix, which can be P again to make PtP or a matrix
for a secondary indicator (say L for likes) to make PtL, we take a row from
Pt (item A) and a column from the second matrix (either P or L, in this
example) (item B) and we calculate the table that Ted Dunning explains on
his webpage: the number of coocurrences that item A *AND* B have been
purchased (or purchased AND liked), the number of times that item A *OR* B
have been purchased (or purchased OR liked), and the number of times that
*neither* item A nor B have been purchased (or purchased or liked). With
this counts we calculate LLR following the formulas that Ted Dunning
provides and the resulting LLR is what goes into the AB element in matrix
PtP or PtL. Correct?

Thank you!

On 16 November 2017 at 17:03, Noelia Osés Fernández <no...@vicomtech.org>
wrote:

> Wonderful! Thanks Daniel!
>
> Suneel, I'm still new to the Apache ecosystem and so I know that Mahout is
> used but only vaguely... I still don't know the different parts well enough
> to have a good understanding of what each of them do (Spark, MLLib, PIO,
> Mahout,...)
>
> Thank you both!
>
> On 16 November 2017 at 16:59, Suneel Marthi <sm...@apache.org> wrote:
>
>> Indeed so. Ted Dunning is an Apache Mahout PMC and committer and the
>> whole idea of Search-based Recommenders stems from his work and insights.
>> If u didn't know, the PIO UR uses Apache Mahout under the hood and hence u
>> see the LLR.
>>
>> On Thu, Nov 16, 2017 at 3:49 PM, Daniel Gabrieli <
>> dgabrieli@salesforce.com> wrote:
>>
>>> I am pretty sure the LLR stuff in UR is based off of this blog post and
>>> associated paper:
>>>
>>> http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
>>>
>>> Accurate Methods for the Statistics of Surprise and Coincidence
>>> by Ted Dunning
>>>
>>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962
>>>
>>>
>>> On Thu, Nov 16, 2017 at 10:26 AM Noelia Osés Fernández <
>>> noses@vicomtech.org> wrote:
>>>
>>>> Hi,
>>>>
>>>> I've been trying to understand how the UR algorithm works and I think I
>>>> have a general idea. But I would like to have a *mathematical
>>>> description* of the step in which the LLR comes into play. In the CCO
>>>> presentations I have found it says:
>>>>
>>>> (PtP) compares column to column using
>>>> *log-likelihood based correlation test*
>>>>
>>>> However, I have searched for "log-likelihood based correlation test" in
>>>> google but no joy. All I get are explanations of the likelihood-ratio test
>>>> to compare two models.
>>>>
>>>> I would very much appreciate a math explanation of log-likelihood based
>>>> correlation test. Any pointers to papers or any other literature that
>>>> explains this specifically are much appreciated.
>>>>
>>>> Best regards,
>>>> Noelia
>>>>
>>>
>>
>
>

Re: Log-likelihood based correlation test?

Posted by Noelia Osés Fernández <no...@vicomtech.org>.

Wonderful! Thanks Daniel!

Suneel, I'm still new to the Apache ecosystem and so I know that Mahout is
used but only vaguely... I still don't know the different parts well enough
to have a good understanding of what each of them do (Spark, MLLib, PIO,
Mahout,...)

Thank you both!

On 16 November 2017 at 16:59, Suneel Marthi <sm...@apache.org> wrote:

> Indeed so. Ted Dunning is an Apache Mahout PMC and committer and the whole
> idea of Search-based Recommenders stems from his work and insights.  If u
> didn't know, the PIO UR uses Apache Mahout under the hood and hence u see
> the LLR.
>
> On Thu, Nov 16, 2017 at 3:49 PM, Daniel Gabrieli <dgabrieli@salesforce.com
> > wrote:
>
>> I am pretty sure the LLR stuff in UR is based off of this blog post and
>> associated paper:
>>
>> http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
>>
>> Accurate Methods for the Statistics of Surprise and Coincidence
>> by Ted Dunning
>>
>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962
>>
>>
>> On Thu, Nov 16, 2017 at 10:26 AM Noelia Osés Fernández <
>> noses@vicomtech.org> wrote:
>>
>>> Hi,
>>>
>>> I've been trying to understand how the UR algorithm works and I think I
>>> have a general idea. But I would like to have a *mathematical
>>> description* of the step in which the LLR comes into play. In the CCO
>>> presentations I have found it says:
>>>
>>> (PtP) compares column to column using
>>> *log-likelihood based correlation test*
>>>
>>> However, I have searched for "log-likelihood based correlation test" in
>>> google but no joy. All I get are explanations of the likelihood-ratio test
>>> to compare two models.
>>>
>>> I would very much appreciate a math explanation of log-likelihood based
>>> correlation test. Any pointers to papers or any other literature that
>>> explains this specifically are much appreciated.
>>>
>>> Best regards,
>>> Noelia
>>>
>>
>


-- 
<http://www.vicomtech.org>

Noelia Osés Fernández, PhD
Senior Researcher |
Investigadora Senior

noses@vicomtech.org
+[34] 943 30 92 30
Data Intelligence for Energy and
Industrial Processes | Inteligencia
de Datos para Energía y Procesos
Industriales

<https://www.linkedin.com/company/vicomtech>
<https://www.youtube.com/user/VICOMTech>
<ht...@Vicomtech_IK4>

member of:  <http://www.graphicsmedia.net/>     <http://www.ik4.es>

Legal Notice - Privacy policy <http://www.vicomtech.org/en/proteccion-datos>

Re: Log-likelihood based correlation test?

Posted by Suneel Marthi <sm...@apache.org>.

Indeed so. Ted Dunning is an Apache Mahout PMC and committer and the whole
idea of Search-based Recommenders stems from his work and insights.  If u
didn't know, the PIO UR uses Apache Mahout under the hood and hence u see
the LLR.

On Thu, Nov 16, 2017 at 3:49 PM, Daniel Gabrieli <dg...@salesforce.com>
wrote:

> I am pretty sure the LLR stuff in UR is based off of this blog post and
> associated paper:
>
> http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
>
> Accurate Methods for the Statistics of Surprise and Coincidence
> by Ted Dunning
>
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962
>
>
> On Thu, Nov 16, 2017 at 10:26 AM Noelia Osés Fernández <
> noses@vicomtech.org> wrote:
>
>> Hi,
>>
>> I've been trying to understand how the UR algorithm works and I think I
>> have a general idea. But I would like to have a *mathematical
>> description* of the step in which the LLR comes into play. In the CCO
>> presentations I have found it says:
>>
>> (PtP) compares column to column using
>> *log-likelihood based correlation test*
>>
>> However, I have searched for "log-likelihood based correlation test" in
>> google but no joy. All I get are explanations of the likelihood-ratio test
>> to compare two models.
>>
>> I would very much appreciate a math explanation of log-likelihood based
>> correlation test. Any pointers to papers or any other literature that
>> explains this specifically are much appreciated.
>>
>> Best regards,
>> Noelia
>>
>

Re: Log-likelihood based correlation test?

Posted by Daniel Gabrieli <dg...@salesforce.com>.

I am pretty sure the LLR stuff in UR is based off of this blog post and
associated paper:

http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html

Accurate Methods for the Statistics of Surprise and Coincidence
by Ted Dunning

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962


On Thu, Nov 16, 2017 at 10:26 AM Noelia Osés Fernández <no...@vicomtech.org>
wrote:

> Hi,
>
> I've been trying to understand how the UR algorithm works and I think I
> have a general idea. But I would like to have a *mathematical description*
> of the step in which the LLR comes into play. In the CCO presentations I
> have found it says:
>
> (PtP) compares column to column using
> *log-likelihood based correlation test*
>
> However, I have searched for "log-likelihood based correlation test" in
> google but no joy. All I get are explanations of the likelihood-ratio test
> to compare two models.
>
> I would very much appreciate a math explanation of log-likelihood based
> correlation test. Any pointers to papers or any other literature that
> explains this specifically are much appreciated.
>
> Best regards,
> Noelia
>