You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Chris Yocum <cy...@gmail.com> on 2011/09/26 14:37:49 UTC

Middle Irish and NLP

Hello Everyone,

I am working with a student at my university on using NLP techniques in
document categorisation in late Middle Irish.  I am a coder and I know
Java so that won't be a problem.  We are building a corpus at the moment.

We are working on a specific author and what we would like to do is see
if a particular poem/text is his or not based on NLP.  What I was
thinking is we would need a few things:

1) a corpus of Middle Irish texts of the same general linguistic range
(we are working on that at the moment).  Is there any
documentation/knowledge on how to create this (or is this just training
the POS tagger)?

2) Train a model

3) pass that model to the document categoriser with the relevant model
and what kinds of categories there are (his, not his, and unsure).

A few other miscellaneous questions: will we need to put part of speech
tags in the corpus to create the model?

Thanks in advance!,
Chris Yocum

Re: Middle Irish and NLP

Posted by Chris Yocum <cy...@gmail.com>.

Fantastic! Thanks for the info!

Chris

On 26/09/11 14:48, Jason Baldridge wrote:
> Before you do POS tagging and such, you should probably get set up with
> word-based indicators of authorship, like type-token ratios, average word
> length, frequent unigrams and bigrams and so on. Then you just need the
> text, so no annotation or model training is necessary. Usually
> dimensionality reduction techniques like PCA are good in this context too.
> 
> If you haven't already, you should check out Patrick Juola's page:
> 
> http://www.mathcs.duq.edu/~juola/
> 
> And especially his book on authorship attribution.
> 
> If you do want to build POS taggers, there are some useful instructions
> here:
> 
> http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Main_Page
> 
> Jason
> 
> 
> On Mon, Sep 26, 2011 at 7:37 AM, Chris Yocum <cy...@gmail.com> wrote:
> 
>> Hello Everyone,
>>
>> I am working with a student at my university on using NLP techniques in
>> document categorisation in late Middle Irish.  I am a coder and I know
>> Java so that won't be a problem.  We are building a corpus at the moment.
>>
>> We are working on a specific author and what we would like to do is see
>> if a particular poem/text is his or not based on NLP.  What I was
>> thinking is we would need a few things:
>>
>> 1) a corpus of Middle Irish texts of the same general linguistic range
>> (we are working on that at the moment).  Is there any
>> documentation/knowledge on how to create this (or is this just training
>> the POS tagger)?
>>
>> 2) Train a model
>>
>> 3) pass that model to the document categoriser with the relevant model
>> and what kinds of categories there are (his, not his, and unsure).
>>
>> A few other miscellaneous questions: will we need to put part of speech
>> tags in the corpus to create the model?
>>
>> Thanks in advance!,
>> Chris Yocum
>>
>>
>

Re: Middle Irish and NLP

Posted by Jörn Kottmann <ko...@gmail.com>.

On 9/26/11 4:02 PM, Jason Baldridge wrote:
> Yes, though I thought some things didn't make it. I could be misremembering.

As far as I know that are only things which changed anyway, e.g. how to 
build, how to download, etc.
The only bigger thing which was not moved is the tutorial about CONLL06 
because we then implemented
our own parsers and the detokenizer based on this tutorial. But then 
never rewrote the instructions on how
to use this new tools. We also added new documentation to the docbook only.

Jörn

Re: Middle Irish and NLP

Posted by Jason Baldridge <ja...@gmail.com>.

Yes, though I thought some things didn't make it. I could be misremembering.

On Mon, Sep 26, 2011 at 8:51 AM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 9/26/11 3:48 PM, Jason Baldridge wrote:
>
>> If you do want to build POS taggers, there are some useful instructions
>> here:
>>
>> http://sourceforge.net/apps/**mediawiki/opennlp/index.php?**
>> title=Main_Page<http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Main_Page>
>>
>
>
> Didn't we move all our documentation to the docbook?
>
> This one here:
> http://incubator.apache.org/**opennlp/documentation/manual/**opennlp.html<http://incubator.apache.org/opennlp/documentation/manual/opennlp.html>
>
> Jörn
>

Re: Middle Irish and NLP

Posted by Jörn Kottmann <ko...@gmail.com>.

On 9/26/11 3:48 PM, Jason Baldridge wrote:
> If you do want to build POS taggers, there are some useful instructions
> here:
>
> http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Main_Page


Didn't we move all our documentation to the docbook?

This one here:
http://incubator.apache.org/opennlp/documentation/manual/opennlp.html

Jörn

Re: Middle Irish and NLP

Posted by Jason Baldridge <ja...@gmail.com>.

Before you do POS tagging and such, you should probably get set up with
word-based indicators of authorship, like type-token ratios, average word
length, frequent unigrams and bigrams and so on. Then you just need the
text, so no annotation or model training is necessary. Usually
dimensionality reduction techniques like PCA are good in this context too.

If you haven't already, you should check out Patrick Juola's page:

http://www.mathcs.duq.edu/~juola/

And especially his book on authorship attribution.

If you do want to build POS taggers, there are some useful instructions
here:

http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Main_Page

Jason

On Mon, Sep 26, 2011 at 7:37 AM, Chris Yocum <cy...@gmail.com> wrote:

> Hello Everyone,
>
> I am working with a student at my university on using NLP techniques in
> document categorisation in late Middle Irish.  I am a coder and I know
> Java so that won't be a problem.  We are building a corpus at the moment.
>
> We are working on a specific author and what we would like to do is see
> if a particular poem/text is his or not based on NLP.  What I was
> thinking is we would need a few things:
>
> 1) a corpus of Middle Irish texts of the same general linguistic range
> (we are working on that at the moment).  Is there any
> documentation/knowledge on how to create this (or is this just training
> the POS tagger)?
>
> 2) Train a model
>
> 3) pass that model to the document categoriser with the relevant model
> and what kinds of categories there are (his, not his, and unsure).
>
> A few other miscellaneous questions: will we need to put part of speech
> tags in the corpus to create the model?
>
> Thanks in advance!,
> Chris Yocum
>
>