You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov> on 2016/02/25 17:06:06 UTC

classification with extra information

Hi,
  Is it possible to change the prior based on a feature?

  For example, if I have the follow data (very simplified)

Class, Predicates

A, X
A, X
B, X

You would expect class A 2/3 of the time when the feature is just predicate X.

However, lets say I know that another feature Y that can take values {Q,R,S). P(A|Q)=0.8;P(A|R)=0.1;P(A|S)=0.3.

Is there any way to add feature Y to the classifier taking advantage of this information?
Thanks
Dan

Re: classification with extra information

Posted by "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov>.

Sorry, there was a typo.  should read:

The abstraction is getting difficult. Let me get a little more specific, Y is an industry code, there many of them.  For each data row (which obvious has more that just 1 predicate) I have an industry code.  My original thought was that I could have a prior based on the industry. 

I could data have like:

> A,solvent,dust,code=111222
> A,insecticide,code=111312
> …
> B,solvent,diesel,code=111222
> ...
> 
> The problem becomes that I am using the Industry distribution from my training set, not the census.
> 
> By the “best value” I mean when classifying an example that the model has not seen before, I would like the model to classify based on the prior.  If p(A|Y)=0.8,  select A with p=0.8.
> 
> Dan
> 
> On Feb 25, 2016, at 12:02 PM, Nishant Kelkar <ni...@gmail.com>> wrote:
> 
> I guess I don't quite understand then. So your training data is small, but
> you have a potentially high cardinality feature Y from a separate source
> (US Census)...how are you marrying them together then? As in, how does each
> row in your small training set get a Y? Is, for example, X, a common column
> between the two sets, where X --> Y is a one-to-many mapping?
> 
> As far as using the information provided by Y, I think any model that
> estimates a joint probability P(Y, X, label) will inadvertently end up
> using information about P(label | Y), no?
> 
> Also, what does your last line in your previously email mean ("If possible
> I would like to use the best values available.")?
> 
> Best,
> Nishant
> 
> On Thursday, February 25, 2016, Russ, Daniel (NIH/CIT) [E] <
> druss@mail.nih.gov<ma...@mail.nih.gov>> wrote:
> 
> Yes, but my training data is a small biased sample whereas feature “Y” are
> population values (actually taken from the US Census, so a very large
> sample).  If possible I would like to use the best values available.
> 
> 
> Daniel Russ, Ph.D.
> Staff Scientist, Division of Computational Bioscience
> Center for Information Technology
> National Institutes of Health
> U.S. Department of Health and Human Services
> 12 South Drive
> Bethesda,  MD 20892-5624
> 
> On Feb 25, 2016, at 11:29 AM, Nishant Kelkar <ni...@gmail.com>
> <javascript:;><mailto:nishant.k02@gmail.com <javascript:;>>> wrote:
> 
> Hi Dan,
> 
> Can't you call (A, Q) as A', (A,R) as A'', and so on...and just treat them
> as separate labels altogether? Your classifier can then learn using these
> "fake" labels.
> 
> You can then have an in memory map of what each fake label (A'' for
> example) corresponds to in reality (A'' in this case = (A, R)).
> 
> Best Regards,
> Nishant Kelkar
> 
> On Thursday, February 25, 2016, Russ, Daniel (NIH/CIT) [E] <
> druss@mail.nih.gov<ma...@mail.nih.gov> <javascript:;><mailto:druss@mail.nih.gov <javascript:;>>>
> wrote:
> 
> I am not sure I understand.  When I think of the kernel trick, I think of
> converting a linear decision boundary into a higher order decision
> boundary.  (i.e. r<-x^2 + y^2 giving a circular decision boundary).  Maybe
> I am missing something?  I’ll look into this a bit more.
> Dan
> 
> 
> On Feb 25, 2016, at 11:11 AM, Alexander Wallin <
> alexander@wallindevelopment.se<ma...@wallindevelopment.se> <javascript:;><mailto:
> alexander@wallindevelopment.se<ma...@wallindevelopment.se> <javascript:;>> <javascript:;>> wrote:
> 
> Can’t you make a compounded feature (or features), i.e. use the kernel
> trick?
> 
> Alexander
> 
> 25 feb. 2016 kl. 17:06 skrev Russ, Daniel (NIH/CIT) [E] <
> druss@mail.nih.gov<ma...@mail.nih.gov> <javascript:;><mailto:druss@mail.nih.gov <javascript:;>>
> <javascript:;>>:
> 
> Hi,
> Is it possible to change the prior based on a feature?
> 
> For example, if I have the follow data (very simplified)
> 
> Class, Predicates
> 
> A, X
> A, X
> B, X
> 
> You would expect class A 2/3 of the time when the feature is just
> predicate X.
> 
> However, lets say I know that another feature Y that can take values
> {Q,R,S). P(A|Q)=0.8;P(A|R)=0.1;P(A|S)=0.3.
> 
> Is there any way to add feature Y to the classifier taking advantage of
> this information?
> Thanks
> Dan
> 
> 
> 
> 
> 
> 
> 
>

Re: classification with extra information

Posted by "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov>.

The abstraction is getting difficult. Let me get a little more specific, Y is an industry code, there many of them.  For each data row (which obvious has more that just 1 predicate) I have an industry code.  My original thought was that I could have a prior based on the industry. I could data like:

A,solvent,dust,code=111222
A,insecticide,code=111312
…
B,solvent,diesel,code=111222
...

The problem becomes that I am using the Industry distribution from my training set, not the census.

By the “best value” I mean when classifying an example that the model has not seen before, I would like the model to classify based on the prior.  If p(A|Y)=0.8,  select A with p=0.8.

Dan

On Feb 25, 2016, at 12:02 PM, Nishant Kelkar <ni...@gmail.com>> wrote:

I guess I don't quite understand then. So your training data is small, but
you have a potentially high cardinality feature Y from a separate source
(US Census)...how are you marrying them together then? As in, how does each
row in your small training set get a Y? Is, for example, X, a common column
between the two sets, where X --> Y is a one-to-many mapping?

As far as using the information provided by Y, I think any model that
estimates a joint probability P(Y, X, label) will inadvertently end up
using information about P(label | Y), no?

Also, what does your last line in your previously email mean ("If possible
I would like to use the best values available.")?

Best,
Nishant

On Thursday, February 25, 2016, Russ, Daniel (NIH/CIT) [E] <
druss@mail.nih.gov<ma...@mail.nih.gov>> wrote:

Yes, but my training data is a small biased sample whereas feature “Y” are
population values (actually taken from the US Census, so a very large
sample).  If possible I would like to use the best values available.


Daniel Russ, Ph.D.
Staff Scientist, Division of Computational Bioscience
Center for Information Technology
National Institutes of Health
U.S. Department of Health and Human Services
12 South Drive
Bethesda,  MD 20892-5624

On Feb 25, 2016, at 11:29 AM, Nishant Kelkar <ni...@gmail.com>
<javascript:;><mailto:nishant.k02@gmail.com <javascript:;>>> wrote:

Hi Dan,

Can't you call (A, Q) as A', (A,R) as A'', and so on...and just treat them
as separate labels altogether? Your classifier can then learn using these
"fake" labels.

You can then have an in memory map of what each fake label (A'' for
example) corresponds to in reality (A'' in this case = (A, R)).

Best Regards,
Nishant Kelkar

On Thursday, February 25, 2016, Russ, Daniel (NIH/CIT) [E] <
druss@mail.nih.gov<ma...@mail.nih.gov> <javascript:;><mailto:druss@mail.nih.gov <javascript:;>>>
wrote:

I am not sure I understand.  When I think of the kernel trick, I think of
converting a linear decision boundary into a higher order decision
boundary.  (i.e. r<-x^2 + y^2 giving a circular decision boundary).  Maybe
I am missing something?  I’ll look into this a bit more.
Dan


On Feb 25, 2016, at 11:11 AM, Alexander Wallin <
alexander@wallindevelopment.se<ma...@wallindevelopment.se> <javascript:;><mailto:
alexander@wallindevelopment.se<ma...@wallindevelopment.se> <javascript:;>> <javascript:;>> wrote:

Can’t you make a compounded feature (or features), i.e. use the kernel
trick?

Alexander

25 feb. 2016 kl. 17:06 skrev Russ, Daniel (NIH/CIT) [E] <
druss@mail.nih.gov<ma...@mail.nih.gov> <javascript:;><mailto:druss@mail.nih.gov <javascript:;>>
<javascript:;>>:

Hi,
Is it possible to change the prior based on a feature?

For example, if I have the follow data (very simplified)

Class, Predicates

A, X
A, X
B, X

You would expect class A 2/3 of the time when the feature is just
predicate X.

However, lets say I know that another feature Y that can take values
{Q,R,S). P(A|Q)=0.8;P(A|R)=0.1;P(A|S)=0.3.

Is there any way to add feature Y to the classifier taking advantage of
this information?
Thanks
Dan

Re: classification with extra information

Posted by Nishant Kelkar <ni...@gmail.com>.

I guess I don't quite understand then. So your training data is small, but
you have a potentially high cardinality feature Y from a separate source
(US Census)...how are you marrying them together then? As in, how does each
row in your small training set get a Y? Is, for example, X, a common column
between the two sets, where X --> Y is a one-to-many mapping?

As far as using the information provided by Y, I think any model that
estimates a joint probability P(Y, X, label) will inadvertently end up
using information about P(label | Y), no?

Also, what does your last line in your previously email mean ("If possible
I would like to use the best values available.")?

Best,
Nishant

On Thursday, February 25, 2016, Russ, Daniel (NIH/CIT) [E] <
druss@mail.nih.gov> wrote:

> Yes, but my training data is a small biased sample whereas feature “Y” are
> population values (actually taken from the US Census, so a very large
> sample).  If possible I would like to use the best values available.
>
>
> Daniel Russ, Ph.D.
> Staff Scientist, Division of Computational Bioscience
> Center for Information Technology
> National Institutes of Health
> U.S. Department of Health and Human Services
> 12 South Drive
> Bethesda,  MD 20892-5624
>
> On Feb 25, 2016, at 11:29 AM, Nishant Kelkar <nishant.k02@gmail.com
> <javascript:;><mailto:nishant.k02@gmail.com <javascript:;>>> wrote:
>
> Hi Dan,
>
> Can't you call (A, Q) as A', (A,R) as A'', and so on...and just treat them
> as separate labels altogether? Your classifier can then learn using these
> "fake" labels.
>
> You can then have an in memory map of what each fake label (A'' for
> example) corresponds to in reality (A'' in this case = (A, R)).
>
> Best Regards,
> Nishant Kelkar
>
> On Thursday, February 25, 2016, Russ, Daniel (NIH/CIT) [E] <
> druss@mail.nih.gov <javascript:;><mailto:druss@mail.nih.gov <javascript:;>>>
> wrote:
>
> I am not sure I understand.  When I think of the kernel trick, I think of
> converting a linear decision boundary into a higher order decision
> boundary.  (i.e. r<-x^2 + y^2 giving a circular decision boundary).  Maybe
> I am missing something?  I’ll look into this a bit more.
> Dan
>
>
> On Feb 25, 2016, at 11:11 AM, Alexander Wallin <
> alexander@wallindevelopment.se <javascript:;><mailto:
> alexander@wallindevelopment.se <javascript:;>> <javascript:;>> wrote:
>
> Can’t you make a compounded feature (or features), i.e. use the kernel
> trick?
>
> Alexander
>
> 25 feb. 2016 kl. 17:06 skrev Russ, Daniel (NIH/CIT) [E] <
> druss@mail.nih.gov <javascript:;><mailto:druss@mail.nih.gov <javascript:;>>
> <javascript:;>>:
>
> Hi,
> Is it possible to change the prior based on a feature?
>
> For example, if I have the follow data (very simplified)
>
> Class, Predicates
>
> A, X
> A, X
> B, X
>
> You would expect class A 2/3 of the time when the feature is just
> predicate X.
>
> However, lets say I know that another feature Y that can take values
> {Q,R,S). P(A|Q)=0.8;P(A|R)=0.1;P(A|S)=0.3.
>
> Is there any way to add feature Y to the classifier taking advantage of
> this information?
> Thanks
> Dan
>
>
>
>
>
>
>

Re: classification with extra information

Posted by "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov>.

Yes, but my training data is a small biased sample whereas feature “Y” are population values (actually taken from the US Census, so a very large sample).  If possible I would like to use the best values available.


Daniel Russ, Ph.D.
Staff Scientist, Division of Computational Bioscience
Center for Information Technology
National Institutes of Health
U.S. Department of Health and Human Services
12 South Drive
Bethesda,  MD 20892-5624

On Feb 25, 2016, at 11:29 AM, Nishant Kelkar <ni...@gmail.com>> wrote:

Hi Dan,

Can't you call (A, Q) as A', (A,R) as A'', and so on...and just treat them
as separate labels altogether? Your classifier can then learn using these
"fake" labels.

You can then have an in memory map of what each fake label (A'' for
example) corresponds to in reality (A'' in this case = (A, R)).

Best Regards,
Nishant Kelkar

On Thursday, February 25, 2016, Russ, Daniel (NIH/CIT) [E] <
druss@mail.nih.gov<ma...@mail.nih.gov>> wrote:

I am not sure I understand.  When I think of the kernel trick, I think of
converting a linear decision boundary into a higher order decision
boundary.  (i.e. r<-x^2 + y^2 giving a circular decision boundary).  Maybe
I am missing something?  I’ll look into this a bit more.
Dan


On Feb 25, 2016, at 11:11 AM, Alexander Wallin <
alexander@wallindevelopment.se<ma...@wallindevelopment.se> <javascript:;>> wrote:

Can’t you make a compounded feature (or features), i.e. use the kernel
trick?

Alexander

25 feb. 2016 kl. 17:06 skrev Russ, Daniel (NIH/CIT) [E] <
druss@mail.nih.gov<ma...@mail.nih.gov> <javascript:;>>:

Hi,
Is it possible to change the prior based on a feature?

For example, if I have the follow data (very simplified)

Class, Predicates

A, X
A, X
B, X

You would expect class A 2/3 of the time when the feature is just
predicate X.

However, lets say I know that another feature Y that can take values
{Q,R,S). P(A|Q)=0.8;P(A|R)=0.1;P(A|S)=0.3.

Is there any way to add feature Y to the classifier taking advantage of
this information?
Thanks
Dan

Re: classification with extra information

Posted by Nishant Kelkar <ni...@gmail.com>.

Hi Dan,

Can't you call (A, Q) as A', (A,R) as A'', and so on...and just treat them
as separate labels altogether? Your classifier can then learn using these
"fake" labels.

You can then have an in memory map of what each fake label (A'' for
example) corresponds to in reality (A'' in this case = (A, R)).

Best Regards,
Nishant Kelkar

On Thursday, February 25, 2016, Russ, Daniel (NIH/CIT) [E] <
druss@mail.nih.gov> wrote:

> I am not sure I understand.  When I think of the kernel trick, I think of
> converting a linear decision boundary into a higher order decision
> boundary.  (i.e. r<-x^2 + y^2 giving a circular decision boundary).  Maybe
> I am missing something?  I’ll look into this a bit more.
> Dan
>
>
> > On Feb 25, 2016, at 11:11 AM, Alexander Wallin <
> alexander@wallindevelopment.se <javascript:;>> wrote:
> >
> > Can’t you make a compounded feature (or features), i.e. use the kernel
> trick?
> >
> > Alexander
> >
> >> 25 feb. 2016 kl. 17:06 skrev Russ, Daniel (NIH/CIT) [E] <
> druss@mail.nih.gov <javascript:;>>:
> >>
> >> Hi,
> >> Is it possible to change the prior based on a feature?
> >>
> >> For example, if I have the follow data (very simplified)
> >>
> >> Class, Predicates
> >>
> >> A, X
> >> A, X
> >> B, X
> >>
> >> You would expect class A 2/3 of the time when the feature is just
> predicate X.
> >>
> >> However, lets say I know that another feature Y that can take values
> {Q,R,S). P(A|Q)=0.8;P(A|R)=0.1;P(A|S)=0.3.
> >>
> >> Is there any way to add feature Y to the classifier taking advantage of
> this information?
> >> Thanks
> >> Dan
> >>
> >>
> >
>
>

Re: classification with extra information

Posted by "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov>.

I am not sure I understand.  When I think of the kernel trick, I think of converting a linear decision boundary into a higher order decision boundary.  (i.e. r<-x^2 + y^2 giving a circular decision boundary).  Maybe I am missing something?  I’ll look into this a bit more.
Dan


> On Feb 25, 2016, at 11:11 AM, Alexander Wallin <al...@wallindevelopment.se> wrote:
> 
> Can’t you make a compounded feature (or features), i.e. use the kernel trick?
> 
> Alexander
> 
>> 25 feb. 2016 kl. 17:06 skrev Russ, Daniel (NIH/CIT) [E] <dr...@mail.nih.gov>:
>> 
>> Hi,
>> Is it possible to change the prior based on a feature?
>> 
>> For example, if I have the follow data (very simplified)
>> 
>> Class, Predicates
>> 
>> A, X
>> A, X
>> B, X
>> 
>> You would expect class A 2/3 of the time when the feature is just predicate X.
>> 
>> However, lets say I know that another feature Y that can take values {Q,R,S). P(A|Q)=0.8;P(A|R)=0.1;P(A|S)=0.3.
>> 
>> Is there any way to add feature Y to the classifier taking advantage of this information?
>> Thanks
>> Dan
>> 
>> 
>

Re: classification with extra information

Posted by Alexander Wallin <al...@wallindevelopment.se>.

Can’t you make a compounded feature (or features), i.e. use the kernel trick?

Alexander

> 25 feb. 2016 kl. 17:06 skrev Russ, Daniel (NIH/CIT) [E] <dr...@mail.nih.gov>:
> 
> Hi,
>  Is it possible to change the prior based on a feature?
> 
>  For example, if I have the follow data (very simplified)
> 
> Class, Predicates
> 
> A, X
> A, X
> B, X
> 
> You would expect class A 2/3 of the time when the feature is just predicate X.
> 
> However, lets say I know that another feature Y that can take values {Q,R,S). P(A|Q)=0.8;P(A|R)=0.1;P(A|S)=0.3.
> 
> Is there any way to add feature Y to the classifier taking advantage of this information?
> Thanks
> Dan
> 
>