You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@predictionio.apache.org by Daniel Gabrieli <dg...@salesforce.com> on 2017/01/20 13:13:39 UTC

How to transform variables?

Hi,

I am a new to PIO.

I have a variable called X that I would like take the log of during
training and then during prediction as well.  Where is the appropriate
place to put the log function?

My guess is to override the "prepare" method; while I think the prepare
method is called just before training, I am not clear whether it is also
called before prediction.

Do I call the log transformation again somewhere else so that it occurs
during prediction?  Possibly in the predict method?

Thank you,




prepare

Re: How to transform variables?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

I see PIO as a production big-data pipeline. It sounds like what you need is a math framework that is pretty much interactive where you can change the function and do some cross-validation in nearly real time. This seems to imply R, Python, of Scala + Mahout Samsara + Zeppelin. Of these Mahout is the only interactive tool that runs on a Spark cluster backend and so can crunch a lot of data in the interactive Scala shell. If you don’t need big-data, the others might be more familiar. There are lot of regression algorithms prepackaged in those and some in PIO templates.

Then when you have the algorithm designed, put the parameters in engine.json so you won’t have to change code to tune and put it in PIO for everyday production learning/prediction.

On Jan 20, 2017, at 10:17 AM, Daniel Gabrieli <dg...@salesforce.com> wrote:

Thank you. That is helpful. More specifically, I am trying to implement a regression of a form like this:

write_score = B0 + B2*log(math) + B3*log(read)

Where a student's predicted writing score is a function of gender, the log of a math score, the log of a reading score.

But in fact, what I am trying to understand is how to do feature engineering inside of PIO. I want to try various manipulations of the data to figure out what the best features are for a given model (log is a common example). I might want to try, for example, another regression like:

write_score = B0 + B2*(math - read)^2

Where the score on writing is a function of the squared difference between the math and reading scores.

I'd prefer manipulate variables within the PIO Engine because the servers that send the event data to PIO are "just dump pipes" and I'd like to keep the "data science" logic outside of those pipes and inside of PIO as much as possible.

On Fri, Jan 20, 2017 at 12:45 PM Pat Ferrel <pat@occamsmachete.com <ma...@occamsmachete.com>> wrote:
It would help to know what you are trying to implement.

The datasource and preparator are used only during the input part of train, they pass data to the train method of your algorithm when you run `pio train`. The predict method does not use them at all. It may get data from the EventStore, but not through those other classes.

If you need data to always be the log of some number you may want to take the log before it is sent to the EventServer so it will always be a log, event when you get the Query or out of the EventSever.

On Jan 20, 2017, at 5:13 AM, Daniel Gabrieli <dgabrieli@salesforce.com <ma...@salesforce.com>> wrote:

Hi,

I am a new to PIO.

I have a variable called X that I would like take the log of during training and then during prediction as well. Where is the appropriate place to put the log function?

My guess is to override the "prepare" method; while I think the prepare method is called just before training, I am not clear whether it is also called before prediction.

Do I call the log transformation again somewhere else so that it occurs during prediction? Possibly in the predict method?

Thank you,

prepare

Re: How to transform variables?

Posted by Daniel Gabrieli <dg...@salesforce.com>.

Thank you. That is helpful.  More specifically, I am trying to implement a
regression of a form like this:

write_score = B0 + B2*log(math) + B3*log(read)

Where a student's predicted writing score is a function of gender, the log
of a math score, the log of a reading score.

But in fact, what I am trying to understand is how to do feature
engineering inside of PIO.  I want to try various manipulations of the data
to figure out what the best features are for a given model (log is a common
example).  I might want to try, for example, another regression like:

write_score = B0 + B2*(math - read)^2

Where the score on writing is a function of the squared difference between
the math and reading scores.

I'd prefer manipulate variables within the PIO Engine because the servers
that send the event data to PIO are "just dump pipes" and I'd like to keep
the "data science" logic outside of those pipes and inside of PIO as much
as possible.

On Fri, Jan 20, 2017 at 12:45 PM Pat Ferrel <pa...@occamsmachete.com> wrote:

> It would help to know what you are trying to implement.
>
> The datasource and preparator are used only during the input part of
> train, they pass data to the train method of your algorithm when you run
> `pio train`. The predict method does not use them at all. It may get data
> from the EventStore, but not through those other classes.
>
> If you need data to always be the log of some number you may want to take
> the log before it is sent to the EventServer so it will always be a log,
> event when you get the Query or out of the EventSever.
>
>
> On Jan 20, 2017, at 5:13 AM, Daniel Gabrieli <dg...@salesforce.com>
> wrote:
>
> Hi,
>
> I am a new to PIO.
>
> I have a variable called X that I would like take the log of during
> training and then during prediction as well.  Where is the appropriate
> place to put the log function?
>
> My guess is to override the "prepare" method; while I think the prepare
> method is called just before training, I am not clear whether it is also
> called before prediction.
>
> Do I call the log transformation again somewhere else so that it occurs
> during prediction?  Possibly in the predict method?
>
> Thank you,
>
>
>
>
> prepare
>
>
>

Re: How to transform variables?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

It would help to know what you are trying to implement.

The datasource and preparator are used only during the input part of train, they pass data to the train method of your algorithm when you run `pio train`. The predict method does not use them at all. It may get data from the EventStore, but not through those other classes. 

If you need data to always be the log of some number you may want to take the log before it is sent to the EventServer so it will always be a log, event when you get the Query or out of the EventSever. 


On Jan 20, 2017, at 5:13 AM, Daniel Gabrieli <dg...@salesforce.com> wrote:

Hi,

I am a new to PIO.

I have a variable called X that I would like take the log of during training and then during prediction as well.  Where is the appropriate place to put the log function?

My guess is to override the "prepare" method; while I think the prepare method is called just before training, I am not clear whether it is also called before prediction.

Do I call the log transformation again somewhere else so that it occurs during prediction?  Possibly in the predict method?

Thank you,


 
prepare