You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@predictionio.apache.org by Tahir Mushtaq <mu...@architonic.com> on 2016/09/26 10:44:21 UTC

Very few predictions

Hi everyone,

I am using SimilarProducts template. I have around 3 millions of event data
for about 180k unique items, which is collected in 2 months of period.
Original event data size is about 900MB, but after training, model data
size shrinks to only 16KB. And when I try to get predictions, I receive
predictions only for 15 items.

I have only $set and view events in my event store, which looks like
following.


{
  "event" : "$set",
  "entityType" : "item",
  "entityId" : "someEntityId",
  "properties" : {
    "property1" : "property1_value",
    "property2" : "property2_value"
  }
}


{
  "event" : "view",
  "entityType" : "user",
  "entityId" : "userSessionId",
  "targetEntityType" : "item",
  "targetEntityId" : "someTargetEntityId",
  "properties" : {}
}



Few facts about my implementation:
- I have removed the requirement in engine template to set user before user
can view the item, as described here
https://github.com/apache/incubator-predictionio/tree/develop/examples/scala-parallel-similarproduct/no-set-user

- Since I dont want to track users, Im using session id of the user as the
entityId in view event.
- In my case I cannot track if an item is already set in event store or
not. for this reason, I'm setting the item before each view event every
time. As I read many times in forums that it does not affect predictions,
if an item has multiple set events.
- I'm using MySQL to store everything (event data, model data, metadata
etc.) because of certain requirements.

I have following questions about above problem:
1: Why model data is so small and why I'm getting predictions only for a
couple of items?
2: Is this event data quality problem? If yes, How can I test and improve
the data quality?
3: Is it safe to remove old duplicate set events with MySQL query and leave
only the latest set event for item? Will it help regarding data quality?
4: I see different settings for ALS algorithm in engine.json file. Can
tweaking those settings in someway help? Are those settings explained
somewhere?

Currently my ALS algorithm settings looks like this:

"algorithms": [
    {
      "name": "als",
      "params": {
        "rank": 10,
        "numIterations" : 10,
        "lambda": 0.01,
        "seed": 3
      }
    }
  ]


Many thanks for your time and suggestions.

Best,
Tahir

Re: Very few predictions

Posted by Kenneth Chan <ke...@apache.org>.
1: Why model data is so small and why I'm getting predictions only for a
couple of items?

the model size depends on number of items. It only save the "item-vector"
of each item.

2: Is this event data quality problem? If yes, How can I test and improve
the data quality?

could be because your data is too sparse. One way to investigate is to open
the engine page (http://localhost:8000 by default) after you run 'pio
deploy' the engine.
and see the printed model output
You should see the info here. take a look the size of "productFeatures".
https://github.com/PredictionIO/template-scala-parallel-similarproduct/blob/develop/src/main/scala/ALSAlgorithm.scala#L30
after

3: Is it safe to remove old duplicate set events with MySQL query and leave
only the latest set event for item? Will it help regarding data quality?

yes. it's safe. (this template doesn't rely on the state change of item
properties info to train model)

4: I see different settings for ALS algorithm in engine.json file. Can
tweaking those settings in someway help? Are those settings explained
somewhere?

see here
http://spark.apache.org/docs/1.6.2/mllib-collaborative-filtering.html#collaborative-filtering



On Mon, Sep 26, 2016 at 3:44 AM, Tahir Mushtaq <mu...@architonic.com>
wrote:

> Hi everyone,
>
> I am using SimilarProducts template. I have around 3 millions of event
> data for about 180k unique items, which is collected in 2 months of period.
> Original event data size is about 900MB, but after training, model data
> size shrinks to only 16KB. And when I try to get predictions, I receive
> predictions only for 15 items.
>
> I have only $set and view events in my event store, which looks like
> following.
>
>
> {
>   "event" : "$set",
>   "entityType" : "item",
>   "entityId" : "someEntityId",
>   "properties" : {
>     "property1" : "property1_value",
>     "property2" : "property2_value"
>   }
> }
>
>
> {
>   "event" : "view",
>   "entityType" : "user",
>   "entityId" : "userSessionId",
>   "targetEntityType" : "item",
>   "targetEntityId" : "someTargetEntityId",
>   "properties" : {}
> }
>
>
>
> Few facts about my implementation:
> - I have removed the requirement in engine template to set user before
> user can view the item, as described here https://github.com/apache/
> incubator-predictionio/tree/develop/examples/scala-
> parallel-similarproduct/no-set-user
> - Since I dont want to track users, Im using session id of the user as the
> entityId in view event.
> - In my case I cannot track if an item is already set in event store or
> not. for this reason, I'm setting the item before each view event every
> time. As I read many times in forums that it does not affect predictions,
> if an item has multiple set events.
> - I'm using MySQL to store everything (event data, model data, metadata
> etc.) because of certain requirements.
>
> I have following questions about above problem:
> 1: Why model data is so small and why I'm getting predictions only for a
> couple of items?
> 2: Is this event data quality problem? If yes, How can I test and improve
> the data quality?
> 3: Is it safe to remove old duplicate set events with MySQL query and
> leave only the latest set event for item? Will it help regarding data
> quality?
> 4: I see different settings for ALS algorithm in engine.json file. Can
> tweaking those settings in someway help? Are those settings explained
> somewhere?
>
> Currently my ALS algorithm settings looks like this:
>
> "algorithms": [
>     {
>       "name": "als",
>       "params": {
>         "rank": 10,
>         "numIterations" : 10,
>         "lambda": 0.01,
>         "seed": 3
>       }
>     }
>   ]
>
>
> Many thanks for your time and suggestions.
>
> Best,
> Tahir
>