You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@bluemarlin.apache.org by GitBox <gi...@apache.org> on 2022/01/20 17:00:36 UTC

[GitHub] [incubator-bluemarlin] jimmylao opened a new issue #31: Validate if the distribution of similarity scores is highly concentrated or well spread

jimmylao opened a new issue #31:
URL: https://github.com/apache/incubator-bluemarlin/issues/31


   process:
   
   1. build DIN model
   2. generate user profile based on his/her keyword score (interest), then compute similarity score among all pairs of users
   3. analyze the distribution of resultant similarity scores to see if they are focused in some narrow range or spread on between 0 and 1 (cosine similarity)
   
   
   results:
   1.	Here’s an example of first 20 user’s keyword score profile.
   
   | user_id | kw1   | kw2   | kw3   | kw4   | kw5   | kw6   | kw7   | kw8   | kw9   | kw10  | kw11  | kw12  | kw13  | kw14  | kw15  |
   |---------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
   | 1       | 0.000 | 0.000 | 0.000 | 0.130 | 0.000 | 0.399 | 0.000 | 0.000 | 0.000 | 0.612 | 0.000 | 0.000 | 0.301 | 0.458 | 0.000 |
   | 5       | 0.000 | 0.000 | 0.078 | 0.000 | 0.000 | 0.416 | 0.000 | 0.000 | 0.366 | 0.436 | 0.384 | 0.000 | 0.189 | 0.000 | 0.541 |
   | 8       | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.563 | 0.000 | 0.000 | 0.649 | 0.678 | 0.000 | 0.000 | 0.000 | 0.600 | 0.000 |
   | 10      | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.279 | 0.000 | 0.125 | 0.000 | 0.223 | 0.000 |
   | 11      | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.354 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.162 | 0.275 | 0.000 | 0.000 |
   | 15      | 0.000 | 0.000 | 0.099 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.509 | 0.000 | 0.000 | 0.249 | 0.000 | 0.000 |
   | 22      | 0.000 | 0.000 | 0.152 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.515 | 0.000 | 0.000 | 0.000 | 0.423 | 0.000 |
   | 30      | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.474 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
   | 34      | 0.000 | 0.000 | 0.000 | 0.000 | 0.299 | 0.000 | 0.000 | 0.000 | 0.000 | 0.410 | 0.000 | 0.149 | 0.000 | 0.383 | 0.000 |
   | 35      | 0.000 | 0.000 | 0.145 | 0.000 | 0.000 | 0.646 | 0.000 | 0.000 | 0.311 | 0.000 | 0.000 | 0.000 | 0.000 | 0.440 | 0.000 |
   | 37      | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.423 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
   | 39      | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.496 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
   | 41      | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.327 | 0.250 | 0.000 | 0.000 | 0.000 | 0.307 | 0.000 | 0.000 | 0.382 | 0.000 |
   | 43      | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.349 | 0.000 | 0.000 | 0.000 | 0.430 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
   | 47      | 0.000 | 0.000 | 0.094 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.424 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
   | 49      | 0.305 | 0.509 | 0.000 | 0.000 | 0.000 | 0.721 | 0.000 | 0.000 | 0.000 | 0.758 | 0.000 | 0.000 | 0.000 | 0.740 | 0.000 |
   | 51      | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.415 | 0.000 | 0.000 | 0.128 | 0.000 | 0.000 | 0.000 |
   | 52      | 0.000 | 0.000 | 0.134 | 0.000 | 0.000 | 0.336 | 0.000 | 0.000 | 0.000 | 0.446 | 0.000 | 0.090 | 0.000 | 0.415 | 0.000 |
   | 53      | 0.106 | 0.000 | 0.000 | 0.000 | 0.406 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
   | 55      | 0.000 | 0.000 | 0.000 | 0.122 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.371 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
   
   2.	Pairwise user similarity score was computed based on each user’s keyword score profile. Here’s a example of pairwise similarity scores based on keyword profile score of 1st 20 users above. It’s shown that the similarity score is well distributed between 0 and 1 instead of all focusing on lower end (0) or high end (1).
   
   |       | user1  | user2 | user3 | user4 | user5 | user6 | user7 | user8 | user9 | user10 | user11 | user12 | user13 | user14 | user15 | user16 | user17 | user18 | user19 | user20 |
   |--------|-------|-------|-------|-------|-------|-------|-------|-------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|-------|
   | user1  | 1.000 | 0.536 | 0.794 | 0.782 | 0.510 | 0.729 | 0.807 | 0.663 | 0.708  | 0.583  | 0.000  | 0.000  | 0.517  | 0.788  | 0.648  | 0.837  | 0.000  | 0.906  | 0.000  | 0.674 |
   | user2  | 0.536 | 1.000 | 0.621 | 0.325 | 0.422 | 0.486 | 0.349 | 0.441 | 0.277  | 0.466  | 0.370  | 0.370  | 0.401  | 0.607  | 0.447  | 0.451  | 0.354  | 0.488  | 0.000  | 0.419 |
   | user3  | 0.794 | 0.621 | 1.000 | 0.684 | 0.335 | 0.481 | 0.707 | 0.543 | 0.623  | 0.779  | 0.520  | 0.520  | 0.517  | 0.706  | 0.530  | 0.774  | 0.497  | 0.831  | 0.000  | 0.516 |
   | user4  | 0.782 | 0.325 | 0.684 | 1.000 | 0.112 | 0.653 | 0.920 | 0.737 | 0.884  | 0.304  | 0.000  | 0.000  | 0.352  | 0.572  | 0.720  | 0.704  | 0.097  | 0.844  | 0.000  | 0.701 |
   | user5  | 0.510 | 0.422 | 0.335 | 0.112 | 1.000 | 0.250 | 0.000 | 0.000 | 0.078  | 0.562  | 0.000  | 0.000  | 0.379  | 0.468  | 0.000  | 0.379  | 0.100  | 0.393  | 0.000  | 0.000 |
   | user6  | 0.729 | 0.486 | 0.481 | 0.653 | 0.250 | 1.000 | 0.705 | 0.885 | 0.556  | 0.029  | 0.000  | 0.000  | 0.000  | 0.687  | 0.901  | 0.475  | 0.000  | 0.585  | 0.000  | 0.841 |
   | user7  | 0.807 | 0.349 | 0.707 | 0.920 | 0.000 | 0.705 | 1.000 | 0.753 | 0.836  | 0.357  | 0.000  | 0.000  | 0.369  | 0.585  | 0.784  | 0.729  | 0.000  | 0.872  | 0.000  | 0.716 |
   | user8  | 0.663 | 0.441 | 0.543 | 0.737 | 0.000 | 0.885 | 0.753 | 1.000 | 0.628  | 0.000  | 0.000  | 0.000  | 0.000  | 0.776  | 0.976  | 0.537  | 0.000  | 0.624  | 0.000  | 0.950 |
   | user9  | 0.708 | 0.277 | 0.623 | 0.884 | 0.078 | 0.556 | 0.836 | 0.628 | 1.000  | 0.302  | 0.000  | 0.000  | 0.350  | 0.488  | 0.613  | 0.644  | 0.067  | 0.761  | 0.443  | 0.597 |
   | user10 | 0.583 | 0.466 | 0.779 | 0.304 | 0.562 | 0.029 | 0.357 | 0.000 | 0.302  | 1.000  | 0.365  | 0.365  | 0.694  | 0.477  | 0.037  | 0.656  | 0.349  | 0.688  | 0.000  | 0.000 |
   | user11 | 0.000 | 0.370 | 0.520 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000  | 0.365  | 1.000  | 1.000  | 0.000  | 0.000  | 0.000  | 0.000  | 0.956  | 0.000  | 0.000  | 0.000 |
   | user12 | 0.000 | 0.370 | 0.520 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000  | 0.365  | 1.000  | 1.000  | 0.000  | 0.000  | 0.000  | 0.000  | 0.956  | 0.000  | 0.000  | 0.000 |
   | user13 | 0.517 | 0.401 | 0.517 | 0.352 | 0.379 | 0.000 | 0.369 | 0.000 | 0.350  | 0.694  | 0.000  | 0.000  | 1.000  | 0.322  | 0.000  | 0.573  | 0.000  | 0.587  | 0.000  | 0.000 |
   | user14 | 0.788 | 0.607 | 0.706 | 0.572 | 0.468 | 0.687 | 0.585 | 0.776 | 0.488  | 0.477  | 0.000  | 0.000  | 0.322  | 1.000  | 0.758  | 0.739  | 0.000  | 0.782  | 0.000  | 0.738 |
   | user15 | 0.648 | 0.447 | 0.530 | 0.720 | 0.000 | 0.901 | 0.784 | 0.976 | 0.613  | 0.037  | 0.000  | 0.000  | 0.000  | 0.758  | 1.000  | 0.524  | 0.000  | 0.650  | 0.000  | 0.928 |
   | user16 | 0.837 | 0.451 | 0.774 | 0.704 | 0.379 | 0.475 | 0.729 | 0.537 | 0.644  | 0.656  | 0.000  | 0.000  | 0.573  | 0.739  | 0.524  | 1.000  | 0.000  | 0.880  | 0.055  | 0.510 |
   | user17 | 0.000 | 0.354 | 0.497 | 0.097 | 0.100 | 0.000 | 0.000 | 0.000 | 0.067  | 0.349  | 0.956  | 0.956  | 0.000  | 0.000  | 0.000  | 0.000  | 1.000  | 0.037  | 0.000  | 0.000 |
   | user18 | 0.906 | 0.488 | 0.831 | 0.844 | 0.393 | 0.585 | 0.872 | 0.624 | 0.761  | 0.688  | 0.000  | 0.000  | 0.587  | 0.782  | 0.650  | 0.880  | 0.037  | 1.000  | 0.000  | 0.593 |
   | user19 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.443  | 0.000  | 0.000  | 0.000  | 0.000  | 0.000  | 0.000  | 0.055  | 0.000  | 0.000  | 1.000  | 0.000 |
   | user20 | 0.674 | 0.419 | 0.516 | 0.701 | 0.000 | 0.841 | 0.716 | 0.950 | 0.597  | 0.000  | 0.000  | 0.000  | 0.000  | 0.738  | 0.928  | 0.510  | 0.000  | 0.593  | 0.000  | 1.000 |
   
   3.	Computed pairwise similarity score distribution among first 20k user, resulting in 20,000 x 20,000 similarity score matrix (cosine similarity score was used), the distribution of the values in the matrix is shown below -> it’s almost a perfect normal distribution.
   ![image](https://user-images.githubusercontent.com/60371672/150385884-17b5b80a-6a12-4987-8072-8416fc799d34.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@bluemarlin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org