You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by "viraf.bankwalla@yahoo.com.INVALID" <vi...@yahoo.com.INVALID> on 2019/07/03 14:22:11 UTC

Document Classification with imbalanced data

 I am trying document classification using OpenNLP however my data is highly unbalanced (majority class is 97%).  I recognize that I could randomly over/under sample the data set, and am reading up on SMOTE and ADASYN (not sure how to apply these to OpenNLP).  
Any suggestions on dealing with the highly unbalanced data would be appreciated.
Thanks
- viraf

Re: Document Classification with imbalanced data

Posted by Tommaso Teofili <to...@gmail.com>.

or you may hook into the training part and give a higher weight to the very
rare class with respect to the common class in order to make occurrences of
that rare class have higher impact in changing the model parameters/weights.

Regards,
Tommaso

On Wed, 3 Jul 2019 at 17:51, Dan Russ <da...@gmail.com> wrote:

> You may have to run one class at a time and find a way to resolve cases
> where more than 1 class wants a document.
> Daniel
>
> > On Jul 3, 2019, at 11:49 AM, viraf.bankwalla@yahoo.com.INVALID wrote:
> >
> > Thanks, I am unfamiliar with the approaches that you mentioned - will
> investigate.  I forgot to mention that this is a multi-class classification
> problem.  Each sample represents a page of a corpus of document that have
> been scanned and text extracted using OCR (thus noisy text)
> > Label  | Samples | %-------+---------+----------------C1     | 131613  |
> 97.71C2     |    873  |  0.65C3     |    830  |  0.62C4     |    492  |
> 0.37C5     |    456  |  0.34C6     |    430  |  0.32
> > - viraf
> >
> >
> >    On Wednesday, July 3, 2019, 10:31:44 AM EDT, Dan Russ <
> danruss00@gmail.com> wrote:
> >
> > Have you considered using outlier detection methods?  I’m not really an
> expert on this, but maybe you can define your majority class very well, and
> the other class is the outlier.  Another option may be one-sided
> classification (https://en.wikipedia.org/wiki/One-class_classification),
> SVDD is an example of this. Finally, you might want to look at data
> augmentation techniques.  I am in the middle of some work using conditional
> GANs, but it is not working out so great for me at the moment.
> >
> > Let me know if any of these work out for you.
> > Daniel
> >
> >
> >> On Jul 3, 2019, at 10:22 AM, viraf.bankwalla@yahoo.com.INVALID wrote:
> >>
> >> I am trying document classification using OpenNLP however my data is
> highly unbalanced (majority class is 97%).  I recognize that I could
> randomly over/under sample the data set, and am reading up on SMOTE and
> ADASYN (not sure how to apply these to OpenNLP).
>
>

Re: Document Classification with imbalanced data

Posted by Dan Russ <da...@gmail.com>.

You may have to run one class at a time and find a way to resolve cases where more than 1 class wants a document.
Daniel

> On Jul 3, 2019, at 11:49 AM, viraf.bankwalla@yahoo.com.INVALID wrote:
> 
> Thanks, I am unfamiliar with the approaches that you mentioned - will investigate.  I forgot to mention that this is a multi-class classification problem.  Each sample represents a page of a corpus of document that have been scanned and text extracted using OCR (thus noisy text)
> Label  | Samples | %-------+---------+----------------C1     | 131613  | 97.71C2     |    873  |  0.65C3     |    830  |  0.62C4     |    492  |  0.37C5     |    456  |  0.34C6     |    430  |  0.32
> - viraf
> 
> 
>    On Wednesday, July 3, 2019, 10:31:44 AM EDT, Dan Russ <da...@gmail.com> wrote:  
> 
> Have you considered using outlier detection methods?  I’m not really an expert on this, but maybe you can define your majority class very well, and the other class is the outlier.  Another option may be one-sided classification (https://en.wikipedia.org/wiki/One-class_classification), SVDD is an example of this. Finally, you might want to look at data augmentation techniques.  I am in the middle of some work using conditional GANs, but it is not working out so great for me at the moment.
> 
> Let me know if any of these work out for you.
> Daniel
> 
> 
>> On Jul 3, 2019, at 10:22 AM, viraf.bankwalla@yahoo.com.INVALID wrote:
>> 
>> I am trying document classification using OpenNLP however my data is highly unbalanced (majority class is 97%).  I recognize that I could randomly over/under sample the data set, and am reading up on SMOTE and ADASYN (not sure how to apply these to OpenNLP).

Re: Document Classification with imbalanced data

Posted by "viraf.bankwalla@yahoo.com.INVALID" <vi...@yahoo.com.INVALID>.

 Thanks, I am unfamiliar with the approaches that you mentioned - will investigate.  I forgot to mention that this is a multi-class classification problem.  Each sample represents a page of a corpus of document that have been scanned and text extracted using OCR (thus noisy text)
Label  | Samples | %-------+---------+----------------C1     | 131613  | 97.71C2     |    873  |  0.65C3     |    830  |  0.62C4     |    492  |  0.37C5     |    456  |  0.34C6     |    430  |  0.32
- viraf

    On Wednesday, July 3, 2019, 10:31:44 AM EDT, Dan Russ <da...@gmail.com> wrote:  

 Have you considered using outlier detection methods?  I’m not really an expert on this, but maybe you can define your majority class very well, and the other class is the outlier.  Another option may be one-sided classification (https://en.wikipedia.org/wiki/One-class_classification), SVDD is an example of this. Finally, you might want to look at data augmentation techniques.  I am in the middle of some work using conditional GANs, but it is not working out so great for me at the moment.

Let me know if any of these work out for you.
Daniel

> On Jul 3, 2019, at 10:22 AM, viraf.bankwalla@yahoo.com.INVALID wrote:
> 
> I am trying document classification using OpenNLP however my data is highly unbalanced (majority class is 97%).  I recognize that I could randomly over/under sample the data set, and am reading up on SMOTE and ADASYN (not sure how to apply these to OpenNLP).

Re: Document Classification with imbalanced data

Posted by Dan Russ <da...@gmail.com>.

Have you considered using outlier detection methods?  I’m not really an expert on this, but maybe you can define your majority class very well, and the other class is the outlier.  Another option may be one-sided classification (https://en.wikipedia.org/wiki/One-class_classification), SVDD is an example of this. Finally, you might want to look at data augmentation techniques.  I am in the middle of some work using conditional GANs, but it is not working out so great for me at the moment.

Let me know if any of these work out for you.
Daniel

> On Jul 3, 2019, at 10:22 AM, viraf.bankwalla@yahoo.com.INVALID wrote:
> 
> I am trying document classification using OpenNLP however my data is highly unbalanced (majority class is 97%).  I recognize that I could randomly over/under sample the data set, and am reading up on SMOTE and ADASYN (not sure how to apply these to OpenNLP).