You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@predictionio.apache.org by "Pat Ferrel (JIRA)" <ji...@apache.org> on 2017/02/02 23:30:51 UTC

[jira] [Commented] (PIO-45) SelfCleaningDatasource erases all data

    [ https://issues.apache.org/jira/browse/PIO-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15850734#comment-15850734 ] 

Pat Ferrel commented on PIO-45:
-------------------------------

[~emergentorder] This has gone through several fixes but still as of Feb 2 2017 does not work. 

Attached is an input dataset, import it into an app with the supplied python import script. The script writes data to the EventServer in backwards order so the first are most recent and the last are further in the past. So looking at the input file the top is most recent.

Run a template with the SelfCleaningDataSource enabled, like the UR. The line enabling it is commented out, just remove the comment.

I then created an eventWindow in engine.json with the following:

"eventWindow": {
        "duration": "3650 days",
        "removeDuplicates": true,
        "compressProperties": false
}

Which passed, so de-dup seems to work. 

Then I tried:

"eventWindow": {
        "duration": "3650 days",
        "removeDuplicates": true,
        "compressProperties": true
}

But the compressed properties do not work. What should happen is the newest set of properties of a given name should be the value after compression. In fact compression should never affect the properties as they are returned aggregated. But if you export the app data after running compression a simple failure case is that Galaxy, which has this input:

Galaxy,$set,categories:Phones:Electronics:Samsung
Galaxy,$set,categories:Phones:Electronics
Galaxy,$set,categories:Phones:Electronics:Samsung

Obviously should have: categories:Phones:Electronics:Samsung. 

Without the SelfCleaningDataSource I checked the model the ur creates and this is the value written to the model:

"categories": [
                  "Phones",
                  "Electronics",
                  "Samsung"
               ],

After property compression by adding a "true" to the engine.json definition, the model dumped from the app in the EventServer has:

"Galaxy","properties":"categories":["Phones","Electronics"]

There appear to be several other errors in property compression but this should suffice as an illustration.

This seems pretty severe since the properties will never get back in sync.


> SelfCleaningDatasource erases all data
> --------------------------------------
>
>                 Key: PIO-45
>                 URL: https://issues.apache.org/jira/browse/PIO-45
>             Project: PredictionIO
>          Issue Type: Bug
>    Affects Versions: 0.10.0-incubating
>            Reporter: Pat Ferrel
>            Assignee: Alexander  Merritt
>            Priority: Critical
>             Fix For: 0.11.0
>
>
> as integrated into the UR, in the integration-test, the SelfCleaningDataset erases all data. This feature works fine in the AML version of PIO.
> Although not tested one could assume that this would be true with any other Datasource in other templates.
> [~emergentorder] can you check to see if the PIO merge was done correctly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)