You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Florin Langa <fl...@tenforce.com> on 2013/05/29 10:56:11 UTC

Training model files question

Hello everyone!

I have a question...maybe it a silly question but I don't know how to
manage it. I need to build a classifier for CV. In order to do this I
assume that I need to build a model file containing a set of skills. I have
a list of skills but I don't know how to build the input file. Here is a
sample of my input file:

Tiles and clinkers, setting experience Tile layer .
Silk screen printing Lead typesetter, printing shop .
CTI, computer telephony Alarm operator .
GifBuilder animation program Specialist book writer .
Gardening, study circle leadership Sports centre manager .
........
etc.

The first part, until the next capital letter is the skill name and the
second part is the job name.
Ex: Gardening, study circle leadership - skill name, Sports centre manager
- job name.

In order to create the actual training file I use the following command:

opennlp DoccatTrainer -encoding UTF-8 -lang en -data /tmp/jobs.txt -model
/tmp/en-language-jobs.bin

Now, my question is if the input file I am providing to the above command
has the right format.

Also, please note that I was able to create the training file but when
running the command

opennlp Doccat  /tmp/en-language-jobs.bin < /tmp/programmer.txt the results
are 100% irrelevant.

Best regards,
Florin

Re: Training model files question

Posted by Samik Raychaudhuri <sa...@gmail.com>.
If the documents follow certain rules/patterns, and you are looking for 
specific words/skills in the resume, wouldn't you be better served using 
regular expression based classifiers, rather than NLP based?
HTH.

On 5/29/2013 2:56 AM, Florin Langa wrote:
> Hello everyone!
>
> I have a question...maybe it a silly question but I don't know how to
> manage it. I need to build a classifier for CV. In order to do this I
> assume that I need to build a model file containing a set of skills. I have
> a list of skills but I don't know how to build the input file. Here is a
> sample of my input file:
>
> Tiles and clinkers, setting experience Tile layer .
> Silk screen printing Lead typesetter, printing shop .
> CTI, computer telephony Alarm operator .
> GifBuilder animation program Specialist book writer .
> Gardening, study circle leadership Sports centre manager .
> ........
> etc.
>
> The first part, until the next capital letter is the skill name and the
> second part is the job name.
> Ex: Gardening, study circle leadership - skill name, Sports centre manager
> - job name.
>
> In order to create the actual training file I use the following command:
>
> opennlp DoccatTrainer -encoding UTF-8 -lang en -data /tmp/jobs.txt -model
> /tmp/en-language-jobs.bin
>
> Now, my question is if the input file I am providing to the above command
> has the right format.
>
> Also, please note that I was able to create the training file but when
> running the command
>
> opennlp Doccat  /tmp/en-language-jobs.bin < /tmp/programmer.txt the results
> are 100% irrelevant.
>
> Best regards,
> Florin
>


RE: Training model files question

Posted by Ian Jackson <Ia...@trilliumsoftware.com>.
The regular expression name finder (opennlp.tools.namefind.RegexNameFinder) treats a sentence as group of tokens separated by a space. The regular expression use the Java Pattern for expression. Each token would be separated by a space in your regular expression. So the regular expression would be something like "[Cc]omputer [aA]rchitecture" which handles both upper and lower case

The DictionaryNameFinder makes a similar attempt to handle multiple tokens.
-----Original Message-----
From: Florin Langa [mailto:florin.langa@tenforce.com] 
Sent: Wednesday, May 29, 2013 11:56 AM
To: users; kottmann@gmail.com
Subject: Re: Training model files question

Hello Jorn,

First of all thank you for your answer. Now...I have another question for you...what if my category1 is containing multiple words?
For example let's say that one category is "Computer architecture". As I understood only the first token (in this case computer is considered). How can I create a category containing multiple tokens?
In the meanwhile I will follow your advice and I will have a look to the name finder as well.

Thank you!

Best regards,
Florin


2013/5/29 Jörn Kottmann <ko...@gmail.com>

> Hello,
>
> not sure I understand what you are trying to do.
>
> The doccat component can assign a category to a text (or a piece of 
> text), so that will probably work well if you want to assign a 
> category to an entire CV or a paragraph in it.
>
> If you want to identify skills mentioned inside a CV you might want to 
> use the name finder instead (have a look at its documentation).
>
> Anyway, the training format for the doccat component is one document 
> per line where all the tokens are whitespace tokenized, the first 
> token in a line is the category (explained more detailed in the 
> documentation with a sample).
>
> like this:
> category1 token_a token_b token_c
> category2 token_c token_x
> ....
>
> To do some testing you should have at least have a hundred lines in 
> your training file.
>
> HTH,
> Jörn
>
>
> On 05/29/2013 10:56 AM, Florin Langa wrote:
>
>> Hello everyone!
>>
>> I have a question...maybe it a silly question but I don't know how to 
>> manage it. I need to build a classifier for CV. In order to do this I 
>> assume that I need to build a model file containing a set of skills. 
>> I have a list of skills but I don't know how to build the input file. 
>> Here is a sample of my input file:
>>
>> Tiles and clinkers, setting experience Tile layer .
>> Silk screen printing Lead typesetter, printing shop .
>> CTI, computer telephony Alarm operator .
>> GifBuilder animation program Specialist book writer .
>> Gardening, study circle leadership Sports centre manager .
>> ........
>> etc.
>>
>> The first part, until the next capital letter is the skill name and 
>> the second part is the job name.
>> Ex: Gardening, study circle leadership - skill name, Sports centre 
>> manager
>> - job name.
>>
>> In order to create the actual training file I use the following command:
>>
>> opennlp DoccatTrainer -encoding UTF-8 -lang en -data /tmp/jobs.txt 
>> -model /tmp/en-language-jobs.bin
>>
>> Now, my question is if the input file I am providing to the above 
>> command has the right format.
>>
>> Also, please note that I was able to create the training file but 
>> when running the command
>>
>> opennlp Doccat  /tmp/en-language-jobs.bin < /tmp/programmer.txt the 
>> results are 100% irrelevant.
>>
>> Best regards,
>> Florin
>>
>>
>


Re: Training model files question

Posted by Florin Langa <fl...@tenforce.com>.
Hello Jorn,

First of all thank you for your answer. Now...I have another question for
you...what if my category1 is containing multiple words?
For example let's say that one category is "Computer architecture". As I
understood only the first token (in this case computer is considered). How
can I create a category containing multiple tokens?
In the meanwhile I will follow your advice and I will have a look to the
name finder as well.

Thank you!

Best regards,
Florin


2013/5/29 Jörn Kottmann <ko...@gmail.com>

> Hello,
>
> not sure I understand what you are trying to do.
>
> The doccat component can assign a category to a text (or a piece of text),
> so that will probably work well if you want to assign a category to an
> entire
> CV or a paragraph in it.
>
> If you want to identify skills mentioned inside a CV you might want to use
> the name finder instead (have a look at its documentation).
>
> Anyway, the training format for the doccat component is one document per
> line
> where all the tokens are whitespace tokenized, the first token in a line
> is the category
> (explained more detailed in the documentation with a sample).
>
> like this:
> category1 token_a token_b token_c
> category2 token_c token_x
> ....
>
> To do some testing you should have at least have a hundred lines in your
> training file.
>
> HTH,
> Jörn
>
>
> On 05/29/2013 10:56 AM, Florin Langa wrote:
>
>> Hello everyone!
>>
>> I have a question...maybe it a silly question but I don't know how to
>> manage it. I need to build a classifier for CV. In order to do this I
>> assume that I need to build a model file containing a set of skills. I
>> have
>> a list of skills but I don't know how to build the input file. Here is a
>> sample of my input file:
>>
>> Tiles and clinkers, setting experience Tile layer .
>> Silk screen printing Lead typesetter, printing shop .
>> CTI, computer telephony Alarm operator .
>> GifBuilder animation program Specialist book writer .
>> Gardening, study circle leadership Sports centre manager .
>> ........
>> etc.
>>
>> The first part, until the next capital letter is the skill name and the
>> second part is the job name.
>> Ex: Gardening, study circle leadership - skill name, Sports centre manager
>> - job name.
>>
>> In order to create the actual training file I use the following command:
>>
>> opennlp DoccatTrainer -encoding UTF-8 -lang en -data /tmp/jobs.txt -model
>> /tmp/en-language-jobs.bin
>>
>> Now, my question is if the input file I am providing to the above command
>> has the right format.
>>
>> Also, please note that I was able to create the training file but when
>> running the command
>>
>> opennlp Doccat  /tmp/en-language-jobs.bin < /tmp/programmer.txt the
>> results
>> are 100% irrelevant.
>>
>> Best regards,
>> Florin
>>
>>
>

Re: Training model files question

Posted by Jörn Kottmann <ko...@gmail.com>.
Hello,

not sure I understand what you are trying to do.

The doccat component can assign a category to a text (or a piece of text),
so that will probably work well if you want to assign a category to an 
entire
CV or a paragraph in it.

If you want to identify skills mentioned inside a CV you might want to use
the name finder instead (have a look at its documentation).

Anyway, the training format for the doccat component is one document per 
line
where all the tokens are whitespace tokenized, the first token in a line 
is the category
(explained more detailed in the documentation with a sample).

like this:
category1 token_a token_b token_c
category2 token_c token_x
....

To do some testing you should have at least have a hundred lines in your 
training file.

HTH,
Jörn

On 05/29/2013 10:56 AM, Florin Langa wrote:
> Hello everyone!
>
> I have a question...maybe it a silly question but I don't know how to
> manage it. I need to build a classifier for CV. In order to do this I
> assume that I need to build a model file containing a set of skills. I have
> a list of skills but I don't know how to build the input file. Here is a
> sample of my input file:
>
> Tiles and clinkers, setting experience Tile layer .
> Silk screen printing Lead typesetter, printing shop .
> CTI, computer telephony Alarm operator .
> GifBuilder animation program Specialist book writer .
> Gardening, study circle leadership Sports centre manager .
> ........
> etc.
>
> The first part, until the next capital letter is the skill name and the
> second part is the job name.
> Ex: Gardening, study circle leadership - skill name, Sports centre manager
> - job name.
>
> In order to create the actual training file I use the following command:
>
> opennlp DoccatTrainer -encoding UTF-8 -lang en -data /tmp/jobs.txt -model
> /tmp/en-language-jobs.bin
>
> Now, my question is if the input file I am providing to the above command
> has the right format.
>
> Also, please note that I was able to create the training file but when
> running the command
>
> opennlp Doccat  /tmp/en-language-jobs.bin < /tmp/programmer.txt the results
> are 100% irrelevant.
>
> Best regards,
> Florin
>