You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean R. Owen (Jira)" <ji...@apache.org> on 2020/12/07 15:43:00 UTC

[jira] [Updated] (SPARK-33487) Let ML ALS recommend for BOTH subsets - users and items

     [ https://issues.apache.org/jira/browse/SPARK-33487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean R. Owen updated SPARK-33487:
---------------------------------
    Priority: Minor  (was: Major)

> Let ML ALS recommend for BOTH subsets - users and items
> -------------------------------------------------------
>
>                 Key: SPARK-33487
>                 URL: https://issues.apache.org/jira/browse/SPARK-33487
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>    Affects Versions: 3.0.1
>            Reporter: Rose Aysina
>            Priority: Minor
>
> Currently ALS in Spark ML supports next methods for getting recommendations:
>  * {{recommendForAllUsers(numItems: Int): DataFrame}}
>  * {{recommendForAllItems(numUsers: Int): DataFrame}}
>  * {{recommendForUserSubset(dataset: Dataset[_], numItems: Int): DataFrame}}
>  * {{recommendForItemSubset(dataset: Dataset[_], numUsers: Int): DataFrame}}
>  
> *Feature request:* to add a method that recommends subset of items for subset of users, i.e. both users and items are selected from provided subsets. 
> *Why it is important:* in real-time recommender systems you usually make predict for current users (that's why we need subset of users). And you can just recommend all items that you have, but only those who satisfy some business filters (that's why we need subset of items). 
> *For example:* consider real-time news recommender system. Predict is done for small subset of users (say, for example, visitors for last minute), but it is not allowed to recommend old news or news not related to user country or etc, so at each predict we have some "white" list of items.
> So that's why it will be extremely useful to control *BOTH* for which users make recommendations *AND* which items include in these recommendations. 
> *Related issues:* -SPARK-20679- , but there is just subsets either on users *OR* items. 
> *What we do now:* just make additional filtering after {{recommendForUserSubset}} call, but this method has significant cost - we must receive all items recommendations, i.e. *{{numItems = # all available items}}* and then filter and only then select top-k among them.
> *Why it is bad:* usually subset of items allowed to recommend right now is much smaller than the amount of all seen items in an origin data (in my real dataset it is 220k vs 500). 
> *Design:* I am sorry - I am not familiar with Spark internals so I offer solution based just on my human logic :) 
> {code:scala}
> def recommendForUserItemSubsets(userDataset: Dataset[_], 
>                                 itemDataset: Dataset[_], 
> 				numItems: Int): DataFrame = {
>     val userFactorSubset = getSourceFactorSubset(dataset, userFactors, $(userCol))
>     val itemFactorSubset = getSourceFactorSubset(dataset, itemFactors, $(itemCol))
>     recommendForAll(userFactorSubset, itemFactorSubset, $(userCol), $(itemCol), numItems, $(blockSize))
> }
> {code}
>  
> I will be glad to receive some feedback, is it reasonable request or not and maybe more efficient workarounds. 
>  
> Thanks!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org