You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Chris Yocum <cy...@gmail.com> on 2011/09/26 14:37:49 UTC
Middle Irish and NLP
Hello Everyone,
I am working with a student at my university on using NLP techniques in
document categorisation in late Middle Irish. I am a coder and I know
Java so that won't be a problem. We are building a corpus at the moment.
We are working on a specific author and what we would like to do is see
if a particular poem/text is his or not based on NLP. What I was
thinking is we would need a few things:
1) a corpus of Middle Irish texts of the same general linguistic range
(we are working on that at the moment). Is there any
documentation/knowledge on how to create this (or is this just training
the POS tagger)?
2) Train a model
3) pass that model to the document categoriser with the relevant model
and what kinds of categories there are (his, not his, and unsure).
A few other miscellaneous questions: will we need to put part of speech
tags in the corpus to create the model?
Thanks in advance!,
Chris Yocum
Re: Middle Irish and NLP
Posted by Chris Yocum <cy...@gmail.com>.
Fantastic! Thanks for the info!
Chris
On 26/09/11 14:48, Jason Baldridge wrote:
> Before you do POS tagging and such, you should probably get set up with
> word-based indicators of authorship, like type-token ratios, average word
> length, frequent unigrams and bigrams and so on. Then you just need the
> text, so no annotation or model training is necessary. Usually
> dimensionality reduction techniques like PCA are good in this context too.
>
> If you haven't already, you should check out Patrick Juola's page:
>
> http://www.mathcs.duq.edu/~juola/
>
> And especially his book on authorship attribution.
>
> If you do want to build POS taggers, there are some useful instructions
> here:
>
> http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Main_Page
>
> Jason
>
>
> On Mon, Sep 26, 2011 at 7:37 AM, Chris Yocum <cy...@gmail.com> wrote:
>
>> Hello Everyone,
>>
>> I am working with a student at my university on using NLP techniques in
>> document categorisation in late Middle Irish. I am a coder and I know
>> Java so that won't be a problem. We are building a corpus at the moment.
>>
>> We are working on a specific author and what we would like to do is see
>> if a particular poem/text is his or not based on NLP. What I was
>> thinking is we would need a few things:
>>
>> 1) a corpus of Middle Irish texts of the same general linguistic range
>> (we are working on that at the moment). Is there any
>> documentation/knowledge on how to create this (or is this just training
>> the POS tagger)?
>>
>> 2) Train a model
>>
>> 3) pass that model to the document categoriser with the relevant model
>> and what kinds of categories there are (his, not his, and unsure).
>>
>> A few other miscellaneous questions: will we need to put part of speech
>> tags in the corpus to create the model?
>>
>> Thanks in advance!,
>> Chris Yocum
>>
>>
>
Re: Middle Irish and NLP
Posted by Jörn Kottmann <ko...@gmail.com>.
On 9/26/11 4:02 PM, Jason Baldridge wrote:
> Yes, though I thought some things didn't make it. I could be misremembering.
As far as I know that are only things which changed anyway, e.g. how to
build, how to download, etc.
The only bigger thing which was not moved is the tutorial about CONLL06
because we then implemented
our own parsers and the detokenizer based on this tutorial. But then
never rewrote the instructions on how
to use this new tools. We also added new documentation to the docbook only.
Jörn
Re: Middle Irish and NLP
Posted by Jason Baldridge <ja...@gmail.com>.
Yes, though I thought some things didn't make it. I could be misremembering.
On Mon, Sep 26, 2011 at 8:51 AM, Jörn Kottmann <ko...@gmail.com> wrote:
> On 9/26/11 3:48 PM, Jason Baldridge wrote:
>
>> If you do want to build POS taggers, there are some useful instructions
>> here:
>>
>> http://sourceforge.net/apps/**mediawiki/opennlp/index.php?**
>> title=Main_Page<http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Main_Page>
>>
>
>
> Didn't we move all our documentation to the docbook?
>
> This one here:
> http://incubator.apache.org/**opennlp/documentation/manual/**opennlp.html<http://incubator.apache.org/opennlp/documentation/manual/opennlp.html>
>
> Jörn
>
Re: Middle Irish and NLP
Posted by Jörn Kottmann <ko...@gmail.com>.
On 9/26/11 3:48 PM, Jason Baldridge wrote:
> If you do want to build POS taggers, there are some useful instructions
> here:
>
> http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Main_Page
Didn't we move all our documentation to the docbook?
This one here:
http://incubator.apache.org/opennlp/documentation/manual/opennlp.html
Jörn
Re: Middle Irish and NLP
Posted by Jason Baldridge <ja...@gmail.com>.
Before you do POS tagging and such, you should probably get set up with
word-based indicators of authorship, like type-token ratios, average word
length, frequent unigrams and bigrams and so on. Then you just need the
text, so no annotation or model training is necessary. Usually
dimensionality reduction techniques like PCA are good in this context too.
If you haven't already, you should check out Patrick Juola's page:
http://www.mathcs.duq.edu/~juola/
And especially his book on authorship attribution.
If you do want to build POS taggers, there are some useful instructions
here:
http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Main_Page
Jason
On Mon, Sep 26, 2011 at 7:37 AM, Chris Yocum <cy...@gmail.com> wrote:
> Hello Everyone,
>
> I am working with a student at my university on using NLP techniques in
> document categorisation in late Middle Irish. I am a coder and I know
> Java so that won't be a problem. We are building a corpus at the moment.
>
> We are working on a specific author and what we would like to do is see
> if a particular poem/text is his or not based on NLP. What I was
> thinking is we would need a few things:
>
> 1) a corpus of Middle Irish texts of the same general linguistic range
> (we are working on that at the moment). Is there any
> documentation/knowledge on how to create this (or is this just training
> the POS tagger)?
>
> 2) Train a model
>
> 3) pass that model to the document categoriser with the relevant model
> and what kinds of categories there are (his, not his, and unsure).
>
> A few other miscellaneous questions: will we need to put part of speech
> tags in the corpus to create the model?
>
> Thanks in advance!,
> Chris Yocum
>
>