You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Nirav Patel <np...@xactlycorp.com> on 2016/11/01 10:15:10 UTC

Spark ML - Is IDF model reusable

FYI, I do reuse IDF model while making prediction against new unlabeled
data but not between training and test data while training a model.

On Tue, Nov 1, 2016 at 3:10 AM, Nirav Patel <np...@xactlycorp.com> wrote:

> I am using IDF estimator/model (TF-IDF) to convert text features into
> vectors. Currently, I fit IDF model on all sample data and then transform
> them. I read somewhere that I should split my data into training and test
> before fitting IDF model; Fit IDF only on training data and then use same
> transformer to transform training and test data.
> This raise more questions:
> 1) Why would you do that? What exactly do IDF learn during fitting process
> that it can reuse to transform any new dataset. Perhaps idea is to keep
> same value for |D| and DF|t, D| while use new TF|t, D| ?
> 2) If not then fitting and transforming seems redundant for IDF model
>

-- 


[image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>

<https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn] 
<https://www.linkedin.com/company/xactly-corporation>  [image: Twitter] 
<https://twitter.com/Xactly>  [image: Facebook] 
<https://www.facebook.com/XactlyCorp>  [image: YouTube] 
<http://www.youtube.com/xactlycorporation>

Re: Spark ML - Is IDF model reusable

Posted by Nirav Patel <np...@xactlycorp.com>.
Cool!

So going back to IDF Estimator and Model problem, do you know what an IDF
estimator really does during Fitting process? It must be storing some state
(information) as I mentioned in OP (|D|, DF|t, D| and perhaps TF|t, D|)
that it re-uses to Transform test data (labeled data). Or does it just
maintains a map(lookup) of tokens -> IDF score and uses that to lookup
scores for test data tokens.

Here's one possible thought in context of Naive bayes
Fitting IDF model (idf1) generates conditional probability of a
token(feature) . e.g. let's say IDF of term "software" is 4.5 , so it store
a lookup software -> 4.5
Transforming training data using idf1 basically just creates a dataframe
with above conditional probability vectors for each document
Transforming test data using same idf1 uses a lookup generated above to
create conditional probability vectors for each document. e.g. if it
encounter "software" in test data it's IDF value would be just 4.5

Thanks




On Tue, Nov 1, 2016 at 4:09 PM, ayan guha <gu...@gmail.com> wrote:

> Yes, that is correct. I think I misread a part of it in terms of
> scoring....I think we both are saying same thing so thats a good thing :)
>
> On Wed, Nov 2, 2016 at 10:04 AM, Nirav Patel <np...@xactlycorp.com>
> wrote:
>
>> Hi Ayan,
>>
>> "classification algorithm will for sure need to Fit against new dataset
>> to produce new model" I said this in context of re-training the model. Is
>> it not correct? Isn't it part of re-training?
>>
>> Thanks
>>
>> On Tue, Nov 1, 2016 at 4:01 PM, ayan guha <gu...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> "classification algorithm will for sure need to Fit against new dataset
>>> to produce new model" - I do not think this is correct. Maybe we are
>>> talking semantics but AFAIU, you "train" one model using some dataset, and
>>> then use it for scoring new datasets.
>>>
>>> You may re-train every month, yes. And you may run cross validation once
>>> a month (after re-training) or lower freq like once in 2-3 months to
>>> validate model quality. Here, number of months are not important, but you
>>> must be running cross validation and similar sort of "model evaluation"
>>> work flow typically in lower frequency than Re-Training process.
>>>
>>> On Wed, Nov 2, 2016 at 5:48 AM, Nirav Patel <np...@xactlycorp.com>
>>> wrote:
>>>
>>>> Hi Ayan,
>>>> After deployment, we might re-train it every month. That is whole
>>>> different problem I have explored yet. classification algorithm will for
>>>> sure need to Fit against new dataset to produce new model. Correct me if I
>>>> am wrong but I think I will also FIt new IDF model based on new dataset. At
>>>> that time as well I will follow same training-validation split (or
>>>> corss-validation) to evaluate model performance on new data before
>>>> releasing it to make prediction. So afik , every time you  need to re-train
>>>> model you will need to corss validate using some data split strategy.
>>>>
>>>> I think spark ML document should start explaining mathematical model or
>>>> simple algorithm what Fit and Transform means for particular algorithm
>>>> (IDF, NaiveBayes)
>>>>
>>>> Thanks
>>>>
>>>> On Tue, Nov 1, 2016 at 5:45 AM, ayan guha <gu...@gmail.com> wrote:
>>>>
>>>>> I have come across similar situation recently and decided to run
>>>>> Training  workflow less frequently than scoring workflow.
>>>>>
>>>>> In your use case I would imagine you will run IDF fit workflow once in
>>>>> say a week. It will produce a model object which will be saved. In scoring
>>>>> workflow, you will typically see new unseen dataset and the model generated
>>>>> in training flow will be used to score or label this new dataset.
>>>>>
>>>>> Note, train and test datasets are used during development phase when
>>>>> you are trying to find out which model to use and
>>>>> efficientcy/performance/accuracy etc. It will never be part of
>>>>> workflow. In a little elaborate setting you may want to automate model
>>>>> evaluations, but that's a different story.
>>>>>
>>>>> Not sure if I could explain properly, please feel free to comment.
>>>>> On 1 Nov 2016 22:54, "Nirav Patel" <np...@xactlycorp.com> wrote:
>>>>>
>>>>>> Yes, I do apply NaiveBayes after IDF .
>>>>>>
>>>>>> " you can re-train (fit) on all your data before applying it to
>>>>>> unseen data." Did you mean I can reuse that model to Transform both
>>>>>> training and test data?
>>>>>>
>>>>>> Here's the process:
>>>>>>
>>>>>> Datasets:
>>>>>>
>>>>>>    1. Full sample data (labeled)
>>>>>>    2. Training (labeled)
>>>>>>    3. Test (labeled)
>>>>>>    4. Unseen (non-labeled)
>>>>>>
>>>>>> Here are two workflow options I see:
>>>>>>
>>>>>> Option - 1 (currently using)
>>>>>>
>>>>>>    1. Fit IDF model (idf-1) on full Sample data
>>>>>>    2. Apply(Transform) idf-1 on full sample data
>>>>>>    3. Split data set into Training and Test data
>>>>>>    4. Fit ML model on Training data
>>>>>>    5. Apply(Transform) model on Test data
>>>>>>    6. Apply(Transform) idf-1 on Unseen data
>>>>>>    7. Apply(Transform) model on Unseen data
>>>>>>
>>>>>> Option - 2
>>>>>>
>>>>>>    1. Split sample data into Training and Test data
>>>>>>    2. Fit IDF model (idf-1) only on training data
>>>>>>    3. Apply(Transform) idf-1 on training data
>>>>>>    4. Apply(Transform) idf-1 on test data
>>>>>>    5. Fit ML model on Training data
>>>>>>    6. Apply(Transform) model on Test data
>>>>>>    7. Apply(Transform) idf-1 on Unseen data
>>>>>>    8. Apply(Transform) model on Unseen data
>>>>>>
>>>>>> So you are suggesting Option-2 in this particular case, right?
>>>>>>
>>>>>> On Tue, Nov 1, 2016 at 4:24 AM, Robin East <ro...@xense.co.uk>
>>>>>> wrote:
>>>>>>
>>>>>>> Fit it on training data to evaluate the model. You can either use
>>>>>>> that model to apply to unseen data or you can re-train (fit) on all your
>>>>>>> data before applying it to unseen data.
>>>>>>>
>>>>>>> fit and transform are 2 different things: fit creates a model,
>>>>>>> transform applies a model to data to create transformed output. If you are
>>>>>>> using your training data in a subsequent step (e.g. running logistic
>>>>>>> regression or some other machine learning algorithm) then you need to
>>>>>>> transform your training data using the IDF model before passing it through
>>>>>>> the next step.
>>>>>>>
>>>>>>> ------------------------------------------------------------
>>>>>>> -------------------
>>>>>>> Robin East
>>>>>>> *Spark GraphX in Action* Michael Malak and Robin East
>>>>>>> Manning Publications Co.
>>>>>>> http://www.manning.com/books/spark-graphx-in-action
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 1 Nov 2016, at 11:18, Nirav Patel <np...@xactlycorp.com> wrote:
>>>>>>>
>>>>>>> Just to re-iterate what you said, I should fit IDF model only on
>>>>>>> training data and then re-use it for both test data and then later on
>>>>>>> unseen data to make predictions.
>>>>>>>
>>>>>>> On Tue, Nov 1, 2016 at 3:49 AM, Robin East <ro...@xense.co.uk>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> The point of setting aside a portion of your data as a test set is
>>>>>>>> to try and mimic applying your model to unseen data. If you fit your IDF
>>>>>>>> model to all your data, any evaluation you perform on your test set is
>>>>>>>> likely to over perform compared to ‘real’ unseen data. Effectively you
>>>>>>>> would have overfit your model.
>>>>>>>> ------------------------------------------------------------
>>>>>>>> -------------------
>>>>>>>> Robin East
>>>>>>>> *Spark GraphX in Action* Michael Malak and Robin East
>>>>>>>> Manning Publications Co.
>>>>>>>> http://www.manning.com/books/spark-graphx-in-action
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 1 Nov 2016, at 10:15, Nirav Patel <np...@xactlycorp.com> wrote:
>>>>>>>>
>>>>>>>> FYI, I do reuse IDF model while making prediction against new
>>>>>>>> unlabeled data but not between training and test data while training a
>>>>>>>> model.
>>>>>>>>
>>>>>>>> On Tue, Nov 1, 2016 at 3:10 AM, Nirav Patel <np...@xactlycorp.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I am using IDF estimator/model (TF-IDF) to convert text features
>>>>>>>>> into vectors. Currently, I fit IDF model on all sample data and then
>>>>>>>>> transform them. I read somewhere that I should split my data into training
>>>>>>>>> and test before fitting IDF model; Fit IDF only on training data and then
>>>>>>>>> use same transformer to transform training and test data.
>>>>>>>>> This raise more questions:
>>>>>>>>> 1) Why would you do that? What exactly do IDF learn during fitting
>>>>>>>>> process that it can reuse to transform any new dataset. Perhaps idea is to
>>>>>>>>> keep same value for |D| and DF|t, D| while use new TF|t, D| ?
>>>>>>>>> 2) If not then fitting and transforming seems redundant for IDF
>>>>>>>>> model
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> [image: What's New with Xactly]
>>>>>>>> <http://www.xactlycorp.com/email-click/>
>>>>>>>>
>>>>>>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>>>>>>> <https://www.linkedin.com/company/xactly-corporation>  [image:
>>>>>>>> Twitter] <https://twitter.com/Xactly>  [image: Facebook]
>>>>>>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>>>>>>> <http://www.youtube.com/xactlycorporation>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> [image: What's New with Xactly]
>>>>>>> <http://www.xactlycorp.com/email-click/>
>>>>>>>
>>>>>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>>>>>> <https://www.linkedin.com/company/xactly-corporation>  [image:
>>>>>>> Twitter] <https://twitter.com/Xactly>  [image: Facebook]
>>>>>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>>>>>> <http://www.youtube.com/xactlycorporation>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> [image: What's New with Xactly]
>>>>>> <http://www.xactlycorp.com/email-click/>
>>>>>>
>>>>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>>>>> <https://www.linkedin.com/company/xactly-corporation>  [image:
>>>>>> Twitter] <https://twitter.com/Xactly>  [image: Facebook]
>>>>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>>>>> <http://www.youtube.com/xactlycorporation>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> [image: What's New with Xactly]
>>>> <http://www.xactlycorp.com/email-click/>
>>>>
>>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>>>> <https://twitter.com/Xactly>  [image: Facebook]
>>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>>> <http://www.youtube.com/xactlycorporation>
>>>>
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>>
>>
>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>
>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>> <https://twitter.com/Xactly>  [image: Facebook]
>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>> <http://www.youtube.com/xactlycorporation>
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>

-- 


[image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>

<https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn] 
<https://www.linkedin.com/company/xactly-corporation>  [image: Twitter] 
<https://twitter.com/Xactly>  [image: Facebook] 
<https://www.facebook.com/XactlyCorp>  [image: YouTube] 
<http://www.youtube.com/xactlycorporation>

Re: Spark ML - Is IDF model reusable

Posted by ayan guha <gu...@gmail.com>.
Yes, that is correct. I think I misread a part of it in terms of
scoring....I think we both are saying same thing so thats a good thing :)

On Wed, Nov 2, 2016 at 10:04 AM, Nirav Patel <np...@xactlycorp.com> wrote:

> Hi Ayan,
>
> "classification algorithm will for sure need to Fit against new dataset
> to produce new model" I said this in context of re-training the model. Is
> it not correct? Isn't it part of re-training?
>
> Thanks
>
> On Tue, Nov 1, 2016 at 4:01 PM, ayan guha <gu...@gmail.com> wrote:
>
>> Hi
>>
>> "classification algorithm will for sure need to Fit against new dataset
>> to produce new model" - I do not think this is correct. Maybe we are
>> talking semantics but AFAIU, you "train" one model using some dataset, and
>> then use it for scoring new datasets.
>>
>> You may re-train every month, yes. And you may run cross validation once
>> a month (after re-training) or lower freq like once in 2-3 months to
>> validate model quality. Here, number of months are not important, but you
>> must be running cross validation and similar sort of "model evaluation"
>> work flow typically in lower frequency than Re-Training process.
>>
>> On Wed, Nov 2, 2016 at 5:48 AM, Nirav Patel <np...@xactlycorp.com>
>> wrote:
>>
>>> Hi Ayan,
>>> After deployment, we might re-train it every month. That is whole
>>> different problem I have explored yet. classification algorithm will for
>>> sure need to Fit against new dataset to produce new model. Correct me if I
>>> am wrong but I think I will also FIt new IDF model based on new dataset. At
>>> that time as well I will follow same training-validation split (or
>>> corss-validation) to evaluate model performance on new data before
>>> releasing it to make prediction. So afik , every time you  need to re-train
>>> model you will need to corss validate using some data split strategy.
>>>
>>> I think spark ML document should start explaining mathematical model or
>>> simple algorithm what Fit and Transform means for particular algorithm
>>> (IDF, NaiveBayes)
>>>
>>> Thanks
>>>
>>> On Tue, Nov 1, 2016 at 5:45 AM, ayan guha <gu...@gmail.com> wrote:
>>>
>>>> I have come across similar situation recently and decided to run
>>>> Training  workflow less frequently than scoring workflow.
>>>>
>>>> In your use case I would imagine you will run IDF fit workflow once in
>>>> say a week. It will produce a model object which will be saved. In scoring
>>>> workflow, you will typically see new unseen dataset and the model generated
>>>> in training flow will be used to score or label this new dataset.
>>>>
>>>> Note, train and test datasets are used during development phase when
>>>> you are trying to find out which model to use and
>>>> efficientcy/performance/accuracy etc. It will never be part of
>>>> workflow. In a little elaborate setting you may want to automate model
>>>> evaluations, but that's a different story.
>>>>
>>>> Not sure if I could explain properly, please feel free to comment.
>>>> On 1 Nov 2016 22:54, "Nirav Patel" <np...@xactlycorp.com> wrote:
>>>>
>>>>> Yes, I do apply NaiveBayes after IDF .
>>>>>
>>>>> " you can re-train (fit) on all your data before applying it to
>>>>> unseen data." Did you mean I can reuse that model to Transform both
>>>>> training and test data?
>>>>>
>>>>> Here's the process:
>>>>>
>>>>> Datasets:
>>>>>
>>>>>    1. Full sample data (labeled)
>>>>>    2. Training (labeled)
>>>>>    3. Test (labeled)
>>>>>    4. Unseen (non-labeled)
>>>>>
>>>>> Here are two workflow options I see:
>>>>>
>>>>> Option - 1 (currently using)
>>>>>
>>>>>    1. Fit IDF model (idf-1) on full Sample data
>>>>>    2. Apply(Transform) idf-1 on full sample data
>>>>>    3. Split data set into Training and Test data
>>>>>    4. Fit ML model on Training data
>>>>>    5. Apply(Transform) model on Test data
>>>>>    6. Apply(Transform) idf-1 on Unseen data
>>>>>    7. Apply(Transform) model on Unseen data
>>>>>
>>>>> Option - 2
>>>>>
>>>>>    1. Split sample data into Training and Test data
>>>>>    2. Fit IDF model (idf-1) only on training data
>>>>>    3. Apply(Transform) idf-1 on training data
>>>>>    4. Apply(Transform) idf-1 on test data
>>>>>    5. Fit ML model on Training data
>>>>>    6. Apply(Transform) model on Test data
>>>>>    7. Apply(Transform) idf-1 on Unseen data
>>>>>    8. Apply(Transform) model on Unseen data
>>>>>
>>>>> So you are suggesting Option-2 in this particular case, right?
>>>>>
>>>>> On Tue, Nov 1, 2016 at 4:24 AM, Robin East <ro...@xense.co.uk>
>>>>> wrote:
>>>>>
>>>>>> Fit it on training data to evaluate the model. You can either use
>>>>>> that model to apply to unseen data or you can re-train (fit) on all your
>>>>>> data before applying it to unseen data.
>>>>>>
>>>>>> fit and transform are 2 different things: fit creates a model,
>>>>>> transform applies a model to data to create transformed output. If you are
>>>>>> using your training data in a subsequent step (e.g. running logistic
>>>>>> regression or some other machine learning algorithm) then you need to
>>>>>> transform your training data using the IDF model before passing it through
>>>>>> the next step.
>>>>>>
>>>>>> ------------------------------------------------------------
>>>>>> -------------------
>>>>>> Robin East
>>>>>> *Spark GraphX in Action* Michael Malak and Robin East
>>>>>> Manning Publications Co.
>>>>>> http://www.manning.com/books/spark-graphx-in-action
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 1 Nov 2016, at 11:18, Nirav Patel <np...@xactlycorp.com> wrote:
>>>>>>
>>>>>> Just to re-iterate what you said, I should fit IDF model only on
>>>>>> training data and then re-use it for both test data and then later on
>>>>>> unseen data to make predictions.
>>>>>>
>>>>>> On Tue, Nov 1, 2016 at 3:49 AM, Robin East <ro...@xense.co.uk>
>>>>>> wrote:
>>>>>>
>>>>>>> The point of setting aside a portion of your data as a test set is
>>>>>>> to try and mimic applying your model to unseen data. If you fit your IDF
>>>>>>> model to all your data, any evaluation you perform on your test set is
>>>>>>> likely to over perform compared to ‘real’ unseen data. Effectively you
>>>>>>> would have overfit your model.
>>>>>>> ------------------------------------------------------------
>>>>>>> -------------------
>>>>>>> Robin East
>>>>>>> *Spark GraphX in Action* Michael Malak and Robin East
>>>>>>> Manning Publications Co.
>>>>>>> http://www.manning.com/books/spark-graphx-in-action
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 1 Nov 2016, at 10:15, Nirav Patel <np...@xactlycorp.com> wrote:
>>>>>>>
>>>>>>> FYI, I do reuse IDF model while making prediction against new
>>>>>>> unlabeled data but not between training and test data while training a
>>>>>>> model.
>>>>>>>
>>>>>>> On Tue, Nov 1, 2016 at 3:10 AM, Nirav Patel <np...@xactlycorp.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I am using IDF estimator/model (TF-IDF) to convert text features
>>>>>>>> into vectors. Currently, I fit IDF model on all sample data and then
>>>>>>>> transform them. I read somewhere that I should split my data into training
>>>>>>>> and test before fitting IDF model; Fit IDF only on training data and then
>>>>>>>> use same transformer to transform training and test data.
>>>>>>>> This raise more questions:
>>>>>>>> 1) Why would you do that? What exactly do IDF learn during fitting
>>>>>>>> process that it can reuse to transform any new dataset. Perhaps idea is to
>>>>>>>> keep same value for |D| and DF|t, D| while use new TF|t, D| ?
>>>>>>>> 2) If not then fitting and transforming seems redundant for IDF
>>>>>>>> model
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> [image: What's New with Xactly]
>>>>>>> <http://www.xactlycorp.com/email-click/>
>>>>>>>
>>>>>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>>>>>> <https://www.linkedin.com/company/xactly-corporation>  [image:
>>>>>>> Twitter] <https://twitter.com/Xactly>  [image: Facebook]
>>>>>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>>>>>> <http://www.youtube.com/xactlycorporation>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> [image: What's New with Xactly]
>>>>>> <http://www.xactlycorp.com/email-click/>
>>>>>>
>>>>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>>>>> <https://www.linkedin.com/company/xactly-corporation>  [image:
>>>>>> Twitter] <https://twitter.com/Xactly>  [image: Facebook]
>>>>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>>>>> <http://www.youtube.com/xactlycorporation>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> [image: What's New with Xactly]
>>>>> <http://www.xactlycorp.com/email-click/>
>>>>>
>>>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>>>> <https://www.linkedin.com/company/xactly-corporation>  [image:
>>>>> Twitter] <https://twitter.com/Xactly>  [image: Facebook]
>>>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>>>> <http://www.youtube.com/xactlycorporation>
>>>>
>>>>
>>>
>>>
>>>
>>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>>
>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>>> <https://twitter.com/Xactly>  [image: Facebook]
>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>> <http://www.youtube.com/xactlycorporation>
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
>
>
> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>
> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
> <https://twitter.com/Xactly>  [image: Facebook]
> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
> <http://www.youtube.com/xactlycorporation>
>



-- 
Best Regards,
Ayan Guha

Re: Spark ML - Is IDF model reusable

Posted by Nirav Patel <np...@xactlycorp.com>.
Hi Ayan,

"classification algorithm will for sure need to Fit against new dataset to
produce new model" I said this in context of re-training the model. Is it
not correct? Isn't it part of re-training?

Thanks

On Tue, Nov 1, 2016 at 4:01 PM, ayan guha <gu...@gmail.com> wrote:

> Hi
>
> "classification algorithm will for sure need to Fit against new dataset
> to produce new model" - I do not think this is correct. Maybe we are
> talking semantics but AFAIU, you "train" one model using some dataset, and
> then use it for scoring new datasets.
>
> You may re-train every month, yes. And you may run cross validation once a
> month (after re-training) or lower freq like once in 2-3 months to validate
> model quality. Here, number of months are not important, but you must be
> running cross validation and similar sort of "model evaluation" work flow
> typically in lower frequency than Re-Training process.
>
> On Wed, Nov 2, 2016 at 5:48 AM, Nirav Patel <np...@xactlycorp.com> wrote:
>
>> Hi Ayan,
>> After deployment, we might re-train it every month. That is whole
>> different problem I have explored yet. classification algorithm will for
>> sure need to Fit against new dataset to produce new model. Correct me if I
>> am wrong but I think I will also FIt new IDF model based on new dataset. At
>> that time as well I will follow same training-validation split (or
>> corss-validation) to evaluate model performance on new data before
>> releasing it to make prediction. So afik , every time you  need to re-train
>> model you will need to corss validate using some data split strategy.
>>
>> I think spark ML document should start explaining mathematical model or
>> simple algorithm what Fit and Transform means for particular algorithm
>> (IDF, NaiveBayes)
>>
>> Thanks
>>
>> On Tue, Nov 1, 2016 at 5:45 AM, ayan guha <gu...@gmail.com> wrote:
>>
>>> I have come across similar situation recently and decided to run
>>> Training  workflow less frequently than scoring workflow.
>>>
>>> In your use case I would imagine you will run IDF fit workflow once in
>>> say a week. It will produce a model object which will be saved. In scoring
>>> workflow, you will typically see new unseen dataset and the model generated
>>> in training flow will be used to score or label this new dataset.
>>>
>>> Note, train and test datasets are used during development phase when you
>>> are trying to find out which model to use and efficientcy/performance/accuracy
>>> etc. It will never be part of workflow. In a little elaborate setting you
>>> may want to automate model evaluations, but that's a different story.
>>>
>>> Not sure if I could explain properly, please feel free to comment.
>>> On 1 Nov 2016 22:54, "Nirav Patel" <np...@xactlycorp.com> wrote:
>>>
>>>> Yes, I do apply NaiveBayes after IDF .
>>>>
>>>> " you can re-train (fit) on all your data before applying it to unseen
>>>> data." Did you mean I can reuse that model to Transform both training and
>>>> test data?
>>>>
>>>> Here's the process:
>>>>
>>>> Datasets:
>>>>
>>>>    1. Full sample data (labeled)
>>>>    2. Training (labeled)
>>>>    3. Test (labeled)
>>>>    4. Unseen (non-labeled)
>>>>
>>>> Here are two workflow options I see:
>>>>
>>>> Option - 1 (currently using)
>>>>
>>>>    1. Fit IDF model (idf-1) on full Sample data
>>>>    2. Apply(Transform) idf-1 on full sample data
>>>>    3. Split data set into Training and Test data
>>>>    4. Fit ML model on Training data
>>>>    5. Apply(Transform) model on Test data
>>>>    6. Apply(Transform) idf-1 on Unseen data
>>>>    7. Apply(Transform) model on Unseen data
>>>>
>>>> Option - 2
>>>>
>>>>    1. Split sample data into Training and Test data
>>>>    2. Fit IDF model (idf-1) only on training data
>>>>    3. Apply(Transform) idf-1 on training data
>>>>    4. Apply(Transform) idf-1 on test data
>>>>    5. Fit ML model on Training data
>>>>    6. Apply(Transform) model on Test data
>>>>    7. Apply(Transform) idf-1 on Unseen data
>>>>    8. Apply(Transform) model on Unseen data
>>>>
>>>> So you are suggesting Option-2 in this particular case, right?
>>>>
>>>> On Tue, Nov 1, 2016 at 4:24 AM, Robin East <ro...@xense.co.uk>
>>>> wrote:
>>>>
>>>>> Fit it on training data to evaluate the model. You can either use that
>>>>> model to apply to unseen data or you can re-train (fit) on all your data
>>>>> before applying it to unseen data.
>>>>>
>>>>> fit and transform are 2 different things: fit creates a model,
>>>>> transform applies a model to data to create transformed output. If you are
>>>>> using your training data in a subsequent step (e.g. running logistic
>>>>> regression or some other machine learning algorithm) then you need to
>>>>> transform your training data using the IDF model before passing it through
>>>>> the next step.
>>>>>
>>>>> ------------------------------------------------------------
>>>>> -------------------
>>>>> Robin East
>>>>> *Spark GraphX in Action* Michael Malak and Robin East
>>>>> Manning Publications Co.
>>>>> http://www.manning.com/books/spark-graphx-in-action
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 1 Nov 2016, at 11:18, Nirav Patel <np...@xactlycorp.com> wrote:
>>>>>
>>>>> Just to re-iterate what you said, I should fit IDF model only on
>>>>> training data and then re-use it for both test data and then later on
>>>>> unseen data to make predictions.
>>>>>
>>>>> On Tue, Nov 1, 2016 at 3:49 AM, Robin East <ro...@xense.co.uk>
>>>>> wrote:
>>>>>
>>>>>> The point of setting aside a portion of your data as a test set is to
>>>>>> try and mimic applying your model to unseen data. If you fit your IDF model
>>>>>> to all your data, any evaluation you perform on your test set is likely to
>>>>>> over perform compared to ‘real’ unseen data. Effectively you would have
>>>>>> overfit your model.
>>>>>> ------------------------------------------------------------
>>>>>> -------------------
>>>>>> Robin East
>>>>>> *Spark GraphX in Action* Michael Malak and Robin East
>>>>>> Manning Publications Co.
>>>>>> http://www.manning.com/books/spark-graphx-in-action
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 1 Nov 2016, at 10:15, Nirav Patel <np...@xactlycorp.com> wrote:
>>>>>>
>>>>>> FYI, I do reuse IDF model while making prediction against new
>>>>>> unlabeled data but not between training and test data while training a
>>>>>> model.
>>>>>>
>>>>>> On Tue, Nov 1, 2016 at 3:10 AM, Nirav Patel <np...@xactlycorp.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I am using IDF estimator/model (TF-IDF) to convert text features
>>>>>>> into vectors. Currently, I fit IDF model on all sample data and then
>>>>>>> transform them. I read somewhere that I should split my data into training
>>>>>>> and test before fitting IDF model; Fit IDF only on training data and then
>>>>>>> use same transformer to transform training and test data.
>>>>>>> This raise more questions:
>>>>>>> 1) Why would you do that? What exactly do IDF learn during fitting
>>>>>>> process that it can reuse to transform any new dataset. Perhaps idea is to
>>>>>>> keep same value for |D| and DF|t, D| while use new TF|t, D| ?
>>>>>>> 2) If not then fitting and transforming seems redundant for IDF model
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> [image: What's New with Xactly]
>>>>>> <http://www.xactlycorp.com/email-click/>
>>>>>>
>>>>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>>>>> <https://www.linkedin.com/company/xactly-corporation>  [image:
>>>>>> Twitter] <https://twitter.com/Xactly>  [image: Facebook]
>>>>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>>>>> <http://www.youtube.com/xactlycorporation>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> [image: What's New with Xactly]
>>>>> <http://www.xactlycorp.com/email-click/>
>>>>>
>>>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>>>> <https://www.linkedin.com/company/xactly-corporation>  [image:
>>>>> Twitter] <https://twitter.com/Xactly>  [image: Facebook]
>>>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>>>> <http://www.youtube.com/xactlycorporation>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> [image: What's New with Xactly]
>>>> <http://www.xactlycorp.com/email-click/>
>>>>
>>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>>>> <https://twitter.com/Xactly>  [image: Facebook]
>>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>>> <http://www.youtube.com/xactlycorporation>
>>>
>>>
>>
>>
>>
>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>
>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>> <https://twitter.com/Xactly>  [image: Facebook]
>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>> <http://www.youtube.com/xactlycorporation>
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>

-- 


[image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>

<https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn] 
<https://www.linkedin.com/company/xactly-corporation>  [image: Twitter] 
<https://twitter.com/Xactly>  [image: Facebook] 
<https://www.facebook.com/XactlyCorp>  [image: YouTube] 
<http://www.youtube.com/xactlycorporation>

Re: Spark ML - Is IDF model reusable

Posted by ayan guha <gu...@gmail.com>.
Hi

"classification algorithm will for sure need to Fit against new dataset to
produce new model" - I do not think this is correct. Maybe we are talking
semantics but AFAIU, you "train" one model using some dataset, and then use
it for scoring new datasets.

You may re-train every month, yes. And you may run cross validation once a
month (after re-training) or lower freq like once in 2-3 months to validate
model quality. Here, number of months are not important, but you must be
running cross validation and similar sort of "model evaluation" work flow
typically in lower frequency than Re-Training process.

On Wed, Nov 2, 2016 at 5:48 AM, Nirav Patel <np...@xactlycorp.com> wrote:

> Hi Ayan,
> After deployment, we might re-train it every month. That is whole
> different problem I have explored yet. classification algorithm will for
> sure need to Fit against new dataset to produce new model. Correct me if I
> am wrong but I think I will also FIt new IDF model based on new dataset. At
> that time as well I will follow same training-validation split (or
> corss-validation) to evaluate model performance on new data before
> releasing it to make prediction. So afik , every time you  need to re-train
> model you will need to corss validate using some data split strategy.
>
> I think spark ML document should start explaining mathematical model or
> simple algorithm what Fit and Transform means for particular algorithm
> (IDF, NaiveBayes)
>
> Thanks
>
> On Tue, Nov 1, 2016 at 5:45 AM, ayan guha <gu...@gmail.com> wrote:
>
>> I have come across similar situation recently and decided to run
>> Training  workflow less frequently than scoring workflow.
>>
>> In your use case I would imagine you will run IDF fit workflow once in
>> say a week. It will produce a model object which will be saved. In scoring
>> workflow, you will typically see new unseen dataset and the model generated
>> in training flow will be used to score or label this new dataset.
>>
>> Note, train and test datasets are used during development phase when you
>> are trying to find out which model to use and efficientcy/performance/accuracy
>> etc. It will never be part of workflow. In a little elaborate setting you
>> may want to automate model evaluations, but that's a different story.
>>
>> Not sure if I could explain properly, please feel free to comment.
>> On 1 Nov 2016 22:54, "Nirav Patel" <np...@xactlycorp.com> wrote:
>>
>>> Yes, I do apply NaiveBayes after IDF .
>>>
>>> " you can re-train (fit) on all your data before applying it to unseen
>>> data." Did you mean I can reuse that model to Transform both training and
>>> test data?
>>>
>>> Here's the process:
>>>
>>> Datasets:
>>>
>>>    1. Full sample data (labeled)
>>>    2. Training (labeled)
>>>    3. Test (labeled)
>>>    4. Unseen (non-labeled)
>>>
>>> Here are two workflow options I see:
>>>
>>> Option - 1 (currently using)
>>>
>>>    1. Fit IDF model (idf-1) on full Sample data
>>>    2. Apply(Transform) idf-1 on full sample data
>>>    3. Split data set into Training and Test data
>>>    4. Fit ML model on Training data
>>>    5. Apply(Transform) model on Test data
>>>    6. Apply(Transform) idf-1 on Unseen data
>>>    7. Apply(Transform) model on Unseen data
>>>
>>> Option - 2
>>>
>>>    1. Split sample data into Training and Test data
>>>    2. Fit IDF model (idf-1) only on training data
>>>    3. Apply(Transform) idf-1 on training data
>>>    4. Apply(Transform) idf-1 on test data
>>>    5. Fit ML model on Training data
>>>    6. Apply(Transform) model on Test data
>>>    7. Apply(Transform) idf-1 on Unseen data
>>>    8. Apply(Transform) model on Unseen data
>>>
>>> So you are suggesting Option-2 in this particular case, right?
>>>
>>> On Tue, Nov 1, 2016 at 4:24 AM, Robin East <ro...@xense.co.uk>
>>> wrote:
>>>
>>>> Fit it on training data to evaluate the model. You can either use that
>>>> model to apply to unseen data or you can re-train (fit) on all your data
>>>> before applying it to unseen data.
>>>>
>>>> fit and transform are 2 different things: fit creates a model,
>>>> transform applies a model to data to create transformed output. If you are
>>>> using your training data in a subsequent step (e.g. running logistic
>>>> regression or some other machine learning algorithm) then you need to
>>>> transform your training data using the IDF model before passing it through
>>>> the next step.
>>>>
>>>> ------------------------------------------------------------
>>>> -------------------
>>>> Robin East
>>>> *Spark GraphX in Action* Michael Malak and Robin East
>>>> Manning Publications Co.
>>>> http://www.manning.com/books/spark-graphx-in-action
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 1 Nov 2016, at 11:18, Nirav Patel <np...@xactlycorp.com> wrote:
>>>>
>>>> Just to re-iterate what you said, I should fit IDF model only on
>>>> training data and then re-use it for both test data and then later on
>>>> unseen data to make predictions.
>>>>
>>>> On Tue, Nov 1, 2016 at 3:49 AM, Robin East <ro...@xense.co.uk>
>>>> wrote:
>>>>
>>>>> The point of setting aside a portion of your data as a test set is to
>>>>> try and mimic applying your model to unseen data. If you fit your IDF model
>>>>> to all your data, any evaluation you perform on your test set is likely to
>>>>> over perform compared to ‘real’ unseen data. Effectively you would have
>>>>> overfit your model.
>>>>> ------------------------------------------------------------
>>>>> -------------------
>>>>> Robin East
>>>>> *Spark GraphX in Action* Michael Malak and Robin East
>>>>> Manning Publications Co.
>>>>> http://www.manning.com/books/spark-graphx-in-action
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 1 Nov 2016, at 10:15, Nirav Patel <np...@xactlycorp.com> wrote:
>>>>>
>>>>> FYI, I do reuse IDF model while making prediction against new
>>>>> unlabeled data but not between training and test data while training a
>>>>> model.
>>>>>
>>>>> On Tue, Nov 1, 2016 at 3:10 AM, Nirav Patel <np...@xactlycorp.com>
>>>>> wrote:
>>>>>
>>>>>> I am using IDF estimator/model (TF-IDF) to convert text features into
>>>>>> vectors. Currently, I fit IDF model on all sample data and then transform
>>>>>> them. I read somewhere that I should split my data into training and test
>>>>>> before fitting IDF model; Fit IDF only on training data and then use same
>>>>>> transformer to transform training and test data.
>>>>>> This raise more questions:
>>>>>> 1) Why would you do that? What exactly do IDF learn during fitting
>>>>>> process that it can reuse to transform any new dataset. Perhaps idea is to
>>>>>> keep same value for |D| and DF|t, D| while use new TF|t, D| ?
>>>>>> 2) If not then fitting and transforming seems redundant for IDF model
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> [image: What's New with Xactly]
>>>>> <http://www.xactlycorp.com/email-click/>
>>>>>
>>>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>>>> <https://www.linkedin.com/company/xactly-corporation>  [image:
>>>>> Twitter] <https://twitter.com/Xactly>  [image: Facebook]
>>>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>>>> <http://www.youtube.com/xactlycorporation>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> [image: What's New with Xactly]
>>>> <http://www.xactlycorp.com/email-click/>
>>>>
>>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>>>> <https://twitter.com/Xactly>  [image: Facebook]
>>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>>> <http://www.youtube.com/xactlycorporation>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>>
>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>>> <https://twitter.com/Xactly>  [image: Facebook]
>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>> <http://www.youtube.com/xactlycorporation>
>>
>>
>
>
>
> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>
> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
> <https://twitter.com/Xactly>  [image: Facebook]
> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
> <http://www.youtube.com/xactlycorporation>
>



-- 
Best Regards,
Ayan Guha

Re: Spark ML - Is IDF model reusable

Posted by Nirav Patel <np...@xactlycorp.com>.
Hi Ayan,
After deployment, we might re-train it every month. That is whole different
problem I have explored yet. classification algorithm will for sure need to
Fit against new dataset to produce new model. Correct me if I am wrong but
I think I will also FIt new IDF model based on new dataset. At that time as
well I will follow same training-validation split (or corss-validation) to
evaluate model performance on new data before releasing it to make
prediction. So afik , every time you  need to re-train model you will need
to corss validate using some data split strategy.

I think spark ML document should start explaining mathematical model or
simple algorithm what Fit and Transform means for particular algorithm
(IDF, NaiveBayes)

Thanks

On Tue, Nov 1, 2016 at 5:45 AM, ayan guha <gu...@gmail.com> wrote:

> I have come across similar situation recently and decided to run Training
> workflow less frequently than scoring workflow.
>
> In your use case I would imagine you will run IDF fit workflow once in say
> a week. It will produce a model object which will be saved. In scoring
> workflow, you will typically see new unseen dataset and the model generated
> in training flow will be used to score or label this new dataset.
>
> Note, train and test datasets are used during development phase when you
> are trying to find out which model to use and efficientcy/performance/accuracy
> etc. It will never be part of workflow. In a little elaborate setting you
> may want to automate model evaluations, but that's a different story.
>
> Not sure if I could explain properly, please feel free to comment.
> On 1 Nov 2016 22:54, "Nirav Patel" <np...@xactlycorp.com> wrote:
>
>> Yes, I do apply NaiveBayes after IDF .
>>
>> " you can re-train (fit) on all your data before applying it to unseen
>> data." Did you mean I can reuse that model to Transform both training and
>> test data?
>>
>> Here's the process:
>>
>> Datasets:
>>
>>    1. Full sample data (labeled)
>>    2. Training (labeled)
>>    3. Test (labeled)
>>    4. Unseen (non-labeled)
>>
>> Here are two workflow options I see:
>>
>> Option - 1 (currently using)
>>
>>    1. Fit IDF model (idf-1) on full Sample data
>>    2. Apply(Transform) idf-1 on full sample data
>>    3. Split data set into Training and Test data
>>    4. Fit ML model on Training data
>>    5. Apply(Transform) model on Test data
>>    6. Apply(Transform) idf-1 on Unseen data
>>    7. Apply(Transform) model on Unseen data
>>
>> Option - 2
>>
>>    1. Split sample data into Training and Test data
>>    2. Fit IDF model (idf-1) only on training data
>>    3. Apply(Transform) idf-1 on training data
>>    4. Apply(Transform) idf-1 on test data
>>    5. Fit ML model on Training data
>>    6. Apply(Transform) model on Test data
>>    7. Apply(Transform) idf-1 on Unseen data
>>    8. Apply(Transform) model on Unseen data
>>
>> So you are suggesting Option-2 in this particular case, right?
>>
>> On Tue, Nov 1, 2016 at 4:24 AM, Robin East <ro...@xense.co.uk>
>> wrote:
>>
>>> Fit it on training data to evaluate the model. You can either use that
>>> model to apply to unseen data or you can re-train (fit) on all your data
>>> before applying it to unseen data.
>>>
>>> fit and transform are 2 different things: fit creates a model, transform
>>> applies a model to data to create transformed output. If you are using your
>>> training data in a subsequent step (e.g. running logistic regression or
>>> some other machine learning algorithm) then you need to transform your
>>> training data using the IDF model before passing it through the next step.
>>>
>>> ------------------------------------------------------------
>>> -------------------
>>> Robin East
>>> *Spark GraphX in Action* Michael Malak and Robin East
>>> Manning Publications Co.
>>> http://www.manning.com/books/spark-graphx-in-action
>>>
>>>
>>>
>>>
>>>
>>> On 1 Nov 2016, at 11:18, Nirav Patel <np...@xactlycorp.com> wrote:
>>>
>>> Just to re-iterate what you said, I should fit IDF model only on
>>> training data and then re-use it for both test data and then later on
>>> unseen data to make predictions.
>>>
>>> On Tue, Nov 1, 2016 at 3:49 AM, Robin East <ro...@xense.co.uk>
>>> wrote:
>>>
>>>> The point of setting aside a portion of your data as a test set is to
>>>> try and mimic applying your model to unseen data. If you fit your IDF model
>>>> to all your data, any evaluation you perform on your test set is likely to
>>>> over perform compared to ‘real’ unseen data. Effectively you would have
>>>> overfit your model.
>>>> ------------------------------------------------------------
>>>> -------------------
>>>> Robin East
>>>> *Spark GraphX in Action* Michael Malak and Robin East
>>>> Manning Publications Co.
>>>> http://www.manning.com/books/spark-graphx-in-action
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 1 Nov 2016, at 10:15, Nirav Patel <np...@xactlycorp.com> wrote:
>>>>
>>>> FYI, I do reuse IDF model while making prediction against new unlabeled
>>>> data but not between training and test data while training a model.
>>>>
>>>> On Tue, Nov 1, 2016 at 3:10 AM, Nirav Patel <np...@xactlycorp.com>
>>>> wrote:
>>>>
>>>>> I am using IDF estimator/model (TF-IDF) to convert text features into
>>>>> vectors. Currently, I fit IDF model on all sample data and then transform
>>>>> them. I read somewhere that I should split my data into training and test
>>>>> before fitting IDF model; Fit IDF only on training data and then use same
>>>>> transformer to transform training and test data.
>>>>> This raise more questions:
>>>>> 1) Why would you do that? What exactly do IDF learn during fitting
>>>>> process that it can reuse to transform any new dataset. Perhaps idea is to
>>>>> keep same value for |D| and DF|t, D| while use new TF|t, D| ?
>>>>> 2) If not then fitting and transforming seems redundant for IDF model
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>> [image: What's New with Xactly]
>>>> <http://www.xactlycorp.com/email-click/>
>>>>
>>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>>>> <https://twitter.com/Xactly>  [image: Facebook]
>>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>>> <http://www.youtube.com/xactlycorporation>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>>
>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>>> <https://twitter.com/Xactly>  [image: Facebook]
>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>> <http://www.youtube.com/xactlycorporation>
>>>
>>>
>>>
>>
>>
>>
>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>
>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>> <https://twitter.com/Xactly>  [image: Facebook]
>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>> <http://www.youtube.com/xactlycorporation>
>
>

-- 


[image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>

<https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn] 
<https://www.linkedin.com/company/xactly-corporation>  [image: Twitter] 
<https://twitter.com/Xactly>  [image: Facebook] 
<https://www.facebook.com/XactlyCorp>  [image: YouTube] 
<http://www.youtube.com/xactlycorporation>

Re: Spark ML - Is IDF model reusable

Posted by ayan guha <gu...@gmail.com>.
I have come across similar situation recently and decided to run Training
workflow less frequently than scoring workflow.

In your use case I would imagine you will run IDF fit workflow once in say
a week. It will produce a model object which will be saved. In scoring
workflow, you will typically see new unseen dataset and the model generated
in training flow will be used to score or label this new dataset.

Note, train and test datasets are used during development phase when you
are trying to find out which model to use and
efficientcy/performance/accuracy etc. It will never be part of workflow. In
a little elaborate setting you may want to automate model evaluations, but
that's a different story.

Not sure if I could explain properly, please feel free to comment.
On 1 Nov 2016 22:54, "Nirav Patel" <np...@xactlycorp.com> wrote:

> Yes, I do apply NaiveBayes after IDF .
>
> " you can re-train (fit) on all your data before applying it to unseen
> data." Did you mean I can reuse that model to Transform both training and
> test data?
>
> Here's the process:
>
> Datasets:
>
>    1. Full sample data (labeled)
>    2. Training (labeled)
>    3. Test (labeled)
>    4. Unseen (non-labeled)
>
> Here are two workflow options I see:
>
> Option - 1 (currently using)
>
>    1. Fit IDF model (idf-1) on full Sample data
>    2. Apply(Transform) idf-1 on full sample data
>    3. Split data set into Training and Test data
>    4. Fit ML model on Training data
>    5. Apply(Transform) model on Test data
>    6. Apply(Transform) idf-1 on Unseen data
>    7. Apply(Transform) model on Unseen data
>
> Option - 2
>
>    1. Split sample data into Training and Test data
>    2. Fit IDF model (idf-1) only on training data
>    3. Apply(Transform) idf-1 on training data
>    4. Apply(Transform) idf-1 on test data
>    5. Fit ML model on Training data
>    6. Apply(Transform) model on Test data
>    7. Apply(Transform) idf-1 on Unseen data
>    8. Apply(Transform) model on Unseen data
>
> So you are suggesting Option-2 in this particular case, right?
>
> On Tue, Nov 1, 2016 at 4:24 AM, Robin East <ro...@xense.co.uk> wrote:
>
>> Fit it on training data to evaluate the model. You can either use that
>> model to apply to unseen data or you can re-train (fit) on all your data
>> before applying it to unseen data.
>>
>> fit and transform are 2 different things: fit creates a model, transform
>> applies a model to data to create transformed output. If you are using your
>> training data in a subsequent step (e.g. running logistic regression or
>> some other machine learning algorithm) then you need to transform your
>> training data using the IDF model before passing it through the next step.
>>
>> ------------------------------------------------------------
>> -------------------
>> Robin East
>> *Spark GraphX in Action* Michael Malak and Robin East
>> Manning Publications Co.
>> http://www.manning.com/books/spark-graphx-in-action
>>
>>
>>
>>
>>
>> On 1 Nov 2016, at 11:18, Nirav Patel <np...@xactlycorp.com> wrote:
>>
>> Just to re-iterate what you said, I should fit IDF model only on training
>> data and then re-use it for both test data and then later on unseen data to
>> make predictions.
>>
>> On Tue, Nov 1, 2016 at 3:49 AM, Robin East <ro...@xense.co.uk>
>> wrote:
>>
>>> The point of setting aside a portion of your data as a test set is to
>>> try and mimic applying your model to unseen data. If you fit your IDF model
>>> to all your data, any evaluation you perform on your test set is likely to
>>> over perform compared to ‘real’ unseen data. Effectively you would have
>>> overfit your model.
>>> ------------------------------------------------------------
>>> -------------------
>>> Robin East
>>> *Spark GraphX in Action* Michael Malak and Robin East
>>> Manning Publications Co.
>>> http://www.manning.com/books/spark-graphx-in-action
>>>
>>>
>>>
>>>
>>>
>>> On 1 Nov 2016, at 10:15, Nirav Patel <np...@xactlycorp.com> wrote:
>>>
>>> FYI, I do reuse IDF model while making prediction against new unlabeled
>>> data but not between training and test data while training a model.
>>>
>>> On Tue, Nov 1, 2016 at 3:10 AM, Nirav Patel <np...@xactlycorp.com>
>>> wrote:
>>>
>>>> I am using IDF estimator/model (TF-IDF) to convert text features into
>>>> vectors. Currently, I fit IDF model on all sample data and then transform
>>>> them. I read somewhere that I should split my data into training and test
>>>> before fitting IDF model; Fit IDF only on training data and then use same
>>>> transformer to transform training and test data.
>>>> This raise more questions:
>>>> 1) Why would you do that? What exactly do IDF learn during fitting
>>>> process that it can reuse to transform any new dataset. Perhaps idea is to
>>>> keep same value for |D| and DF|t, D| while use new TF|t, D| ?
>>>> 2) If not then fitting and transforming seems redundant for IDF model
>>>>
>>>
>>>
>>>
>>>
>>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>>
>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>>> <https://twitter.com/Xactly>  [image: Facebook]
>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>> <http://www.youtube.com/xactlycorporation>
>>>
>>>
>>>
>>
>>
>>
>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>
>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>> <https://twitter.com/Xactly>  [image: Facebook]
>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>> <http://www.youtube.com/xactlycorporation>
>>
>>
>>
>
>
>
> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>
> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
> <https://twitter.com/Xactly>  [image: Facebook]
> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
> <http://www.youtube.com/xactlycorporation>

Re: Spark ML - Is IDF model reusable

Posted by Nirav Patel <np...@xactlycorp.com>.
Yes, I do apply NaiveBayes after IDF .

" you can re-train (fit) on all your data before applying it to unseen
data." Did you mean I can reuse that model to Transform both training and
test data?

Here's the process:

Datasets:

   1. Full sample data (labeled)
   2. Training (labeled)
   3. Test (labeled)
   4. Unseen (non-labeled)

Here are two workflow options I see:

Option - 1 (currently using)

   1. Fit IDF model (idf-1) on full Sample data
   2. Apply(Transform) idf-1 on full sample data
   3. Split data set into Training and Test data
   4. Fit ML model on Training data
   5. Apply(Transform) model on Test data
   6. Apply(Transform) idf-1 on Unseen data
   7. Apply(Transform) model on Unseen data

Option - 2

   1. Split sample data into Training and Test data
   2. Fit IDF model (idf-1) only on training data
   3. Apply(Transform) idf-1 on training data
   4. Apply(Transform) idf-1 on test data
   5. Fit ML model on Training data
   6. Apply(Transform) model on Test data
   7. Apply(Transform) idf-1 on Unseen data
   8. Apply(Transform) model on Unseen data

So you are suggesting Option-2 in this particular case, right?

On Tue, Nov 1, 2016 at 4:24 AM, Robin East <ro...@xense.co.uk> wrote:

> Fit it on training data to evaluate the model. You can either use that
> model to apply to unseen data or you can re-train (fit) on all your data
> before applying it to unseen data.
>
> fit and transform are 2 different things: fit creates a model, transform
> applies a model to data to create transformed output. If you are using your
> training data in a subsequent step (e.g. running logistic regression or
> some other machine learning algorithm) then you need to transform your
> training data using the IDF model before passing it through the next step.
>
> ------------------------------------------------------------
> -------------------
> Robin East
> *Spark GraphX in Action* Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
>
>
>
>
>
> On 1 Nov 2016, at 11:18, Nirav Patel <np...@xactlycorp.com> wrote:
>
> Just to re-iterate what you said, I should fit IDF model only on training
> data and then re-use it for both test data and then later on unseen data to
> make predictions.
>
> On Tue, Nov 1, 2016 at 3:49 AM, Robin East <ro...@xense.co.uk> wrote:
>
>> The point of setting aside a portion of your data as a test set is to try
>> and mimic applying your model to unseen data. If you fit your IDF model to
>> all your data, any evaluation you perform on your test set is likely to
>> over perform compared to ‘real’ unseen data. Effectively you would have
>> overfit your model.
>> ------------------------------------------------------------
>> -------------------
>> Robin East
>> *Spark GraphX in Action* Michael Malak and Robin East
>> Manning Publications Co.
>> http://www.manning.com/books/spark-graphx-in-action
>>
>>
>>
>>
>>
>> On 1 Nov 2016, at 10:15, Nirav Patel <np...@xactlycorp.com> wrote:
>>
>> FYI, I do reuse IDF model while making prediction against new unlabeled
>> data but not between training and test data while training a model.
>>
>> On Tue, Nov 1, 2016 at 3:10 AM, Nirav Patel <np...@xactlycorp.com>
>> wrote:
>>
>>> I am using IDF estimator/model (TF-IDF) to convert text features into
>>> vectors. Currently, I fit IDF model on all sample data and then transform
>>> them. I read somewhere that I should split my data into training and test
>>> before fitting IDF model; Fit IDF only on training data and then use same
>>> transformer to transform training and test data.
>>> This raise more questions:
>>> 1) Why would you do that? What exactly do IDF learn during fitting
>>> process that it can reuse to transform any new dataset. Perhaps idea is to
>>> keep same value for |D| and DF|t, D| while use new TF|t, D| ?
>>> 2) If not then fitting and transforming seems redundant for IDF model
>>>
>>
>>
>>
>>
>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>
>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>> <https://twitter.com/Xactly>  [image: Facebook]
>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>> <http://www.youtube.com/xactlycorporation>
>>
>>
>>
>
>
>
> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>
> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
> <https://twitter.com/Xactly>  [image: Facebook]
> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
> <http://www.youtube.com/xactlycorporation>
>
>
>

-- 


[image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>

<https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn] 
<https://www.linkedin.com/company/xactly-corporation>  [image: Twitter] 
<https://twitter.com/Xactly>  [image: Facebook] 
<https://www.facebook.com/XactlyCorp>  [image: YouTube] 
<http://www.youtube.com/xactlycorporation>

Re: Spark ML - Is IDF model reusable

Posted by Robin East <ro...@xense.co.uk>.
Fit it on training data to evaluate the model. You can either use that model to apply to unseen data or you can re-train (fit) on all your data before applying it to unseen data.

fit and transform are 2 different things: fit creates a model, transform applies a model to data to create transformed output. If you are using your training data in a subsequent step (e.g. running logistic regression or some other machine learning algorithm) then you need to transform your training data using the IDF model before passing it through the next step.

-------------------------------------------------------------------------------
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action>





> On 1 Nov 2016, at 11:18, Nirav Patel <np...@xactlycorp.com> wrote:
> 
> Just to re-iterate what you said, I should fit IDF model only on training data and then re-use it for both test data and then later on unseen data to make predictions.
> 
> On Tue, Nov 1, 2016 at 3:49 AM, Robin East <robin.east@xense.co.uk <ma...@xense.co.uk>> wrote:
> The point of setting aside a portion of your data as a test set is to try and mimic applying your model to unseen data. If you fit your IDF model to all your data, any evaluation you perform on your test set is likely to over perform compared to ‘real’ unseen data. Effectively you would have overfit your model.
> -------------------------------------------------------------------------------
> Robin East
> Spark GraphX in Action Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action>
> 
> 
> 
> 
> 
>> On 1 Nov 2016, at 10:15, Nirav Patel <npatel@xactlycorp.com <ma...@xactlycorp.com>> wrote:
>> 
>> FYI, I do reuse IDF model while making prediction against new unlabeled data but not between training and test data while training a model. 
>> 
>> On Tue, Nov 1, 2016 at 3:10 AM, Nirav Patel <npatel@xactlycorp.com <ma...@xactlycorp.com>> wrote:
>> I am using IDF estimator/model (TF-IDF) to convert text features into vectors. Currently, I fit IDF model on all sample data and then transform them. I read somewhere that I should split my data into training and test before fitting IDF model; Fit IDF only on training data and then use same transformer to transform training and test data. 
>> This raise more questions:
>> 1) Why would you do that? What exactly do IDF learn during fitting process that it can reuse to transform any new dataset. Perhaps idea is to keep same value for |D| and DF|t, D| while use new TF|t, D| ?
>> 2) If not then fitting and transforming seems redundant for IDF model
>> 
>> 
>> 
>> 
>>  <http://www.xactlycorp.com/email-click/>
>> 
>>  <https://www.nyse.com/quote/XNYS:XTLY>   <https://www.linkedin.com/company/xactly-corporation>   <https://twitter.com/Xactly>   <https://www.facebook.com/XactlyCorp>   <http://www.youtube.com/xactlycorporation>
> 
> 
> 
> 
>  <http://www.xactlycorp.com/email-click/>
> 
>  <https://www.nyse.com/quote/XNYS:XTLY>   <https://www.linkedin.com/company/xactly-corporation>   <https://twitter.com/Xactly>   <https://www.facebook.com/XactlyCorp>   <http://www.youtube.com/xactlycorporation>

Re: Spark ML - Is IDF model reusable

Posted by Nirav Patel <np...@xactlycorp.com>.
Just to re-iterate what you said, I should fit IDF model only on training
data and then re-use it for both test data and then later on unseen data to
make predictions.

On Tue, Nov 1, 2016 at 3:49 AM, Robin East <ro...@xense.co.uk> wrote:

> The point of setting aside a portion of your data as a test set is to try
> and mimic applying your model to unseen data. If you fit your IDF model to
> all your data, any evaluation you perform on your test set is likely to
> over perform compared to ‘real’ unseen data. Effectively you would have
> overfit your model.
> ------------------------------------------------------------
> -------------------
> Robin East
> *Spark GraphX in Action* Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
>
>
>
>
>
> On 1 Nov 2016, at 10:15, Nirav Patel <np...@xactlycorp.com> wrote:
>
> FYI, I do reuse IDF model while making prediction against new unlabeled
> data but not between training and test data while training a model.
>
> On Tue, Nov 1, 2016 at 3:10 AM, Nirav Patel <np...@xactlycorp.com> wrote:
>
>> I am using IDF estimator/model (TF-IDF) to convert text features into
>> vectors. Currently, I fit IDF model on all sample data and then transform
>> them. I read somewhere that I should split my data into training and test
>> before fitting IDF model; Fit IDF only on training data and then use same
>> transformer to transform training and test data.
>> This raise more questions:
>> 1) Why would you do that? What exactly do IDF learn during fitting
>> process that it can reuse to transform any new dataset. Perhaps idea is to
>> keep same value for |D| and DF|t, D| while use new TF|t, D| ?
>> 2) If not then fitting and transforming seems redundant for IDF model
>>
>
>
>
>
> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>
> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
> <https://twitter.com/Xactly>  [image: Facebook]
> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
> <http://www.youtube.com/xactlycorporation>
>
>
>

-- 


[image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>

<https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn] 
<https://www.linkedin.com/company/xactly-corporation>  [image: Twitter] 
<https://twitter.com/Xactly>  [image: Facebook] 
<https://www.facebook.com/XactlyCorp>  [image: YouTube] 
<http://www.youtube.com/xactlycorporation>

Re: Spark ML - Is IDF model reusable

Posted by Robin East <ro...@xense.co.uk>.
The point of setting aside a portion of your data as a test set is to try and mimic applying your model to unseen data. If you fit your IDF model to all your data, any evaluation you perform on your test set is likely to over perform compared to ‘real’ unseen data. Effectively you would have overfit your model.
-------------------------------------------------------------------------------
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action>





> On 1 Nov 2016, at 10:15, Nirav Patel <np...@xactlycorp.com> wrote:
> 
> FYI, I do reuse IDF model while making prediction against new unlabeled data but not between training and test data while training a model. 
> 
> On Tue, Nov 1, 2016 at 3:10 AM, Nirav Patel <npatel@xactlycorp.com <ma...@xactlycorp.com>> wrote:
> I am using IDF estimator/model (TF-IDF) to convert text features into vectors. Currently, I fit IDF model on all sample data and then transform them. I read somewhere that I should split my data into training and test before fitting IDF model; Fit IDF only on training data and then use same transformer to transform training and test data. 
> This raise more questions:
> 1) Why would you do that? What exactly do IDF learn during fitting process that it can reuse to transform any new dataset. Perhaps idea is to keep same value for |D| and DF|t, D| while use new TF|t, D| ?
> 2) If not then fitting and transforming seems redundant for IDF model
> 
> 
> 
> 
>  <http://www.xactlycorp.com/email-click/>
> 
>  <https://www.nyse.com/quote/XNYS:XTLY>   <https://www.linkedin.com/company/xactly-corporation>   <https://twitter.com/Xactly>   <https://www.facebook.com/XactlyCorp>   <http://www.youtube.com/xactlycorporation>