You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Daniel Frank <da...@trendrr.com> on 2011/05/06 23:00:48 UTC

Text Categorization With Only One Category

OK, so this question isn't necessarily directly related to OpenNLP usage,
but it may be something worth picking your brains over.

We currently employ OpenNLP for a number of categorization applications,
almost always with two categories. Often this is to determine whether a
document does or does not have the property X. Now, there are several more
applications we have in mind for which we can easily determine whether a
document has the property X, but not whether it *doesn't* have that
property. Let me give an example: let's say I was trying to train a
classifier that could detect sarcasm in tweets. Twitter users will sometimes
add #sarcasm to a sarcastic tweet, and sometimes not. Thus, we could easily
obtain part of a training set by simply collecting tweets with the #sarcasm
hashtag. However, we could not automatically gather tweets that definitely
did not contain sarcasm.

Has anyone got any thoughts about how one might train a model with only
training data for one category? I'm fairly certain that plugging it into the
OpenNLP maxent classifier wouldn't produce any sensible results. Perhaps
something to do with clustering? Or some variant of SVM where we could try
to determine distance from a perceived 'center of mass' of a training set?
Very curious to hear the group's thoughts, let me know if anything occurs to
you guys. Cheers,

Dan

PS - I'm aware that there is previous
work<http://staff.science.uva.nl/~otsur/papers/sarcasmAmazonICWSM10.pdf>on
the sarcasm question - I was just using it as an example

Re: Text Categorization With Only One Category

Posted by Jason Baldridge <ja...@gmail.com>.
You definitely cannot give it just one category, so you'll need to come up
with examples that are likely not have that property. In the case of
something like sarcasm, you might guess that there is a fairly low rate of
sarcasm in the general population of tweets, so you can just grab a bunch of
non-#sarcasm tweets and call them the non-sarcastic ones. It's okay if some
of them are actually sarcastic -- as long as you have a good number of
#sarcasm tweets, then that will just be in the noise. (They are all noisy
labels, actually.)

Other fancier things could be done, but I'd try the above first!

On Fri, May 6, 2011 at 4:00 PM, Daniel Frank <da...@trendrr.com> wrote:

> OK, so this question isn't necessarily directly related to OpenNLP usage,
> but it may be something worth picking your brains over.
>
> We currently employ OpenNLP for a number of categorization applications,
> almost always with two categories. Often this is to determine whether a
> document does or does not have the property X. Now, there are several more
> applications we have in mind for which we can easily determine whether a
> document has the property X, but not whether it *doesn't* have that
> property. Let me give an example: let's say I was trying to train a
> classifier that could detect sarcasm in tweets. Twitter users will
> sometimes
> add #sarcasm to a sarcastic tweet, and sometimes not. Thus, we could easily
> obtain part of a training set by simply collecting tweets with the #sarcasm
> hashtag. However, we could not automatically gather tweets that definitely
> did not contain sarcasm.
>
> Has anyone got any thoughts about how one might train a model with only
> training data for one category? I'm fairly certain that plugging it into
> the
> OpenNLP maxent classifier wouldn't produce any sensible results. Perhaps
> something to do with clustering? Or some variant of SVM where we could try
> to determine distance from a perceived 'center of mass' of a training set?
> Very curious to hear the group's thoughts, let me know if anything occurs
> to
> you guys. Cheers,
>
> Dan
>
> PS - I'm aware that there is previous
> work<http://staff.science.uva.nl/~otsur/papers/sarcasmAmazonICWSM10.pdf>on
> the sarcasm question - I was just using it as an example
>



-- 
Jason Baldridge
Assistant Professor, Department of Linguistics
The University of Texas at Austin
http://www.jasonbaldridge.com
http://twitter.com/jasonbaldridge

RE: Text Categorization With Only One Category

Posted by Wei Liu <wl...@fizzback.com>.
Hi,

I think your task is what we called 'one-class classification' , one typical scenario will be spam detection
where you can safely model spam messages but you cannot model 'not spam'.

You might want to look into some outlier detection techniques. I am no expert on them but I've read
that they can be quite successful.

hope it helps.


________________________________________
From: danielhfrank@gmail.com [danielhfrank@gmail.com] On Behalf Of Daniel Frank [daniel@trendrr.com]
Sent: 06 May 2011 22:00
To: opennlp-users@incubator.apache.org
Subject: Text Categorization With Only One Category

OK, so this question isn't necessarily directly related to OpenNLP usage,
but it may be something worth picking your brains over.

We currently employ OpenNLP for a number of categorization applications,
almost always with two categories. Often this is to determine whether a
document does or does not have the property X. Now, there are several more
applications we have in mind for which we can easily determine whether a
document has the property X, but not whether it *doesn't* have that
property. Let me give an example: let's say I was trying to train a
classifier that could detect sarcasm in tweets. Twitter users will sometimes
add #sarcasm to a sarcastic tweet, and sometimes not. Thus, we could easily
obtain part of a training set by simply collecting tweets with the #sarcasm
hashtag. However, we could not automatically gather tweets that definitely
did not contain sarcasm.

Has anyone got any thoughts about how one might train a model with only
training data for one category? I'm fairly certain that plugging it into the
OpenNLP maxent classifier wouldn't produce any sensible results. Perhaps
something to do with clustering? Or some variant of SVM where we could try
to determine distance from a perceived 'center of mass' of a training set?
Very curious to hear the group's thoughts, let me know if anything occurs to
you guys. Cheers,

Dan

PS - I'm aware that there is previous
work<http://staff.science.uva.nl/~otsur/papers/sarcasmAmazonICWSM10.pdf>on
the sarcasm question - I was just using it as an example