You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@opennlp.apache.org by Damiano Porta <da...@gmail.com> on 2016/08/25 13:55:04 UTC

Is sentence detection process really needed?

Hello everybody!

Could someone explain why should I separate each sentence of my documents
to train my models?
My documents are like resume/cv and the sentences can be very different.
For example a sentence could also be :

1. Name: John
2. Surname: travolta

Etc etc
So my question is. What is the problem if i train ny models
(namefinder,tokenizer) with the complete resume/cv one per line?

Could It be a problem?
In this case when i will like to tokenize the resume and doing the NER i
will simply pass the complete resume text skiping the "sentences detection"
process.

Thanks for your opinion in advance!

Best
Damiano

Re: Is sentence detection process really needed?

Posted by "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov>.
You don’t want to train a POS tagger specific for your data.  The POSTagger is trained on sentences.  I don’t think you need to worry about “the”.  If you want to treat a line as a sentence, it may work fine.

My_PRP$ name_NN is_VBZ Damiano._NNP My_PRP$ surname_NN is_VBZ Porta_NNP
My_PRP$ name_NN is_VBZ Damiano._NNP
My_PRP$ surname_NN is_VBZ Porta_NNP

Notice the period added to you last name.  If you add a space to before periods that mark the end of the sentence, then you are actually sentence detecting...

My_PRP$ name_NN is_VBZ Damiano_NNP ._. My_PRP$ surname_NN is_VBZ Porta_NNP
My_PRP$ name_NN is_VBZ Damiano_NNP ._.
My_PRP$ surname_NN is_VBZ Porta_NNP

But yes the tags for these simple sentences are the same.

Daniel Russ, Ph.D.
Staff Scientist, Office of Intramural Research
Center for Information Technology
National Institutes of Health
U.S. Department of Health and Human Services
12 South Drive
Bethesda,  MD 20892-5624

On Aug 26, 2016, at 2:49 PM, Damiano Porta <da...@gmail.com>> wrote:

But i think It is the same no? I Mean. ..I will pass all the content as one
sentence. So in this case the "the" word will be tagged the same.

The problem in this case is that i need to create a tagger model too...

Il 26/Ago/2016 20:14, "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov>> ha
scritto:

The POSTaggerME uses tokenized sentences. In your example, both cases have
2 sentences. sentence 1=My name is Damiano.  sentence 2=My surname is
Porta..

POSTaggerME tagger=…
tagger.tag(new String[]{ “My”,”name”,”is”,”Damiano”});

Daniel Russ, Ph.D.
Staff Scientist, Office of Intramural Research
Center for Information Technology
National Institutes of Health
U.S. Department of Health and Human Services
12 South Drive
Bethesda,  MD 20892-5624

On Aug 26, 2016, at 1:46 PM, Damiano Porta <da...@gmail.com><mailto:
damianoporta@gmail.com<ma...@gmail.com>>> wrote:

Hmmm why?
If i use the postagger for:
"My name is Damiano. My surname is Porta"

OR separate:

My name is Damiano.
My surname is Porta.

I think the tags will be the same, no?

Il 26/Ago/2016 18:24, "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov>
<ma...@mail.nih.gov>> ha
scritto:

If you want to use the part of speech (from the POSTaggerME) as a feature,
you will need sentences.

Daniel Russ, Ph.D.
Staff Scientist, Office of Intramural Research
Center for Information Technology
National Institutes of Health
U.S. Department of Health and Human Services
12 South Drive
Bethesda,  MD 20892-5624

On Aug 26, 2016, at 12:15 PM, Damiano Porta <da...@gmail.com><
mailto:damianoporta@gmail.com><mailto:
damianoporta@gmail.com<ma...@gmail.com>>> wrote:

Thanks Joern!
If i have understood you correctly ...
IF i do not need relation between sentences i can skip the sentences
detection right?

Il 26/Ago/2016 16:33, "Joern Kottmann" <ko...@gmail.com><mailto:kot
tmann@gmail.com<ma...@gmail.com>><mailto:kot
tmann@gmail.com<ma...@gmail.com>>> ha scritto:

The name finder has the concept of "adaptive data" in the feature
generation. The feature generators can remember things from previous
sentences and use it to generate features based on it. Usually that can
help with the recognition rate if you have names that are repeated.  You
can tweak this to your data, or just pass in the entire document.

Jörn

On Fri, Aug 26, 2016 at 3:25 PM, Damiano Porta <da...@gmail.com><
mailto:damianoporta@gmail.com><
mailto:damianoporta@gmail.com>>
wrote:

Hi!
Yes I can train a good model (sure It will takes a lot of time), i have
30k
resumes. So the "data" isnt a problem.
I thought about many things, i am also creating a custom features
generator, with dictionary too (for names) and regex for Birthday,  then
the machine learning will look at their contexts.
So now i need to separate the sentences to create a custom model.
At this point i will not try with one per line CV.

Il 26/Ago/2016 15:10, "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov>
<ma...@mail.nih.gov>
<ma...@mail.nih.gov>>
ha
scritto:

Hi Damiano,
I am not sure that the NameFinder will be effective as-is for you.  Do
you have training data (and I mean a lot of training data)?  You need to
consider what feature are useful in your case.  You might consider a
feature such as line number on the page (since people tend to put their
name on the top or second line), maybe the font-size.  You can add a
dictionary of common names and have a feature “inDictionary”. You will
have
to use your domain knowledge to help you here.

For birthday you may want to consider using regex to pick out dates.
Then look at the context around the date (words before/after, remove
graduated or if another date just before) or maybe years before present
year (if you are looking at resumes, you probably won’t find any 5 year
olds or 200 year olds.

Daniel Russ, Ph.D.
Staff Scientist, Office of Intramural Research
Center for Information Technology
National Institutes of Health
U.S. Department of Health and Human Services
12 South Drive
Bethesda,  MD 20892-5624

On Aug 26, 2016, at 5:57 AM, Damiano Porta <da...@gmail.com><mailto:
damianoporta@gmail.com<ma...@gmail.com>><mailto:
damianoporta@gmail.com<ma...@gmail.com>><
mailto:
damianoporta@gmail.com<ma...@gmail.com><mailto:
damianoporta@gmail.com<ma...@gmail.com>>>> wrote:

Hi Daniel!

Thank you so much for your opinion.
It makes perfectly sense. But i am still a bit confused about the length
of
the sentences.
In a resume there are many names, dates etc etc. So my doubt is regarding
the structure of the sentences because they follow specific patterns
sometimes.

For example i need to extract the personal name, (Who wrote the resume)
the
Birthday etc etc.

As You know there are many names and dates inside a resume so i thought
about to write the entire resume as sentence to also train the "position"
less or more of the entities. If i "decompose" all the resume into
sentences i will lose this information. No?

Damiano

Il 25/Ago/2016 16:26, "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov>
<ma...@mail.nih.gov>
<ma...@mail.nih.gov>
<ma...@mail.nih.gov>> ha
scritto:

Hi Damiano,

 Everyone can feel feel to correct my ignorance but I view the the
name finder as follows.

 I look at it as walking down the sentence and classifying words as
“NOT IN NAME”  until I hit the start of a name than it is “START NAME”,
Followed by “STILL IN NAME” until “NOT IN NAME”.  Take the sentence “Did
John eat the stew”.  Starting with the first word in the sentence decide
what are the odds that the first word starts a name (given that it is the
first word happens to be “Did” in a sentence, with a capital but not all
caps) starts a person’s name.  Then go to then next word in the sentence.
If the first word was not in a name, what are the odds that the second
word
starts a name (given that the previous word did not start a name, the
word
starts with a capital (but not all capital), the word is John, and the
previous word is “Did”).  If it decides that we are starting a name at
“John”, we are now looking for the end.  What are the odds that “eat” is
part of the name given that [“Did”: was not part of the name, was
capitalized] and that [“John”: was the first word in the name, was
capitalized].   You are essentially classifying [Did <- OTHER] [John
<-START] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  If it was “Did John
Smith eat the stew”.  You would have [Did <- OTHER] [John
<-START][Smith<-IN] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  There are
other features other than just word, previous word, and the shape (first
letter capitalized, all letters capitalized).  I think the name finder
uses
part of speech also.


So you see that it is not a name lookup table, but dependent on the
previous classification of words earlier in the sentence.  Therefore, you
must have sentences. Does that help?
Daniel


Daniel Russ, Ph.D.
Staff Scientist, Office of Intramural Research
Center for Information Technology
National Institutes of Health
U.S. Department of Health and Human Services
12 South Drive
Bethesda,  MD 20892-5624

On Aug 25, 2016, at 9:55 AM, Damiano Porta <da...@gmail.com><mailto:
damianoporta@gmail.com<ma...@gmail.com>><mailto:
damianoporta@gmail.com<ma...@gmail.com>><
mailto:
damianoporta@gmail.com<ma...@gmail.com><mailto:
damianoporta@gmail.com<ma...@gmail.com>>><mailto:
damianoporta@gmail.com<ma...@gmail.com><mailto:
damianoporta@gmail.com<ma...@gmail.com>><mailto:
damianoporta@gmail.com<ma...@gmail.com>>>> wrote:

Hello everybody!

Could someone explain why should I separate each sentence of my documents
to train my models?
My documents are like resume/cv and the sentences can be very different.
For example a sentence could also be :

1. Name: John
2. Surname: travolta

Etc etc
So my question is. What is the problem if i train ny models
(namefinder,tokenizer) with the complete resume/cv one per line?

Could It be a problem?
In this case when i will like to tokenize the resume and doing the NER i
will simply pass the complete resume text skiping the "sentences
detection"
process.

Thanks for your opinion in advance!

Best
Damiano




Re: Is sentence detection process really needed?

Posted by Damiano Porta <da...@gmail.com>.
Pardon i meant the "my" word ...

Il 26/Ago/2016 20:49, "Damiano Porta" <da...@gmail.com> ha scritto:

> But i think It is the same no? I Mean. ..I will pass all the content as
> one sentence. So in this case the "the" word will be tagged the same.
>
> The problem in this case is that i need to create a tagger model too...
>
> Il 26/Ago/2016 20:14, "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov>
> ha scritto:
>
>> The POSTaggerME uses tokenized sentences. In your example, both cases
>> have 2 sentences. sentence 1=My name is Damiano.  sentence 2=My surname is
>> Porta..
>>
>> POSTaggerME tagger=…
>> tagger.tag(new String[]{ “My”,”name”,”is”,”Damiano”});
>>
>> Daniel Russ, Ph.D.
>> Staff Scientist, Office of Intramural Research
>> Center for Information Technology
>> National Institutes of Health
>> U.S. Department of Health and Human Services
>> 12 South Drive
>> Bethesda,  MD 20892-5624
>>
>> On Aug 26, 2016, at 1:46 PM, Damiano Porta <damianoporta@gmail.com<mailto
>> :damianoporta@gmail.com>> wrote:
>>
>> Hmmm why?
>> If i use the postagger for:
>> "My name is Damiano. My surname is Porta"
>>
>> OR separate:
>>
>> My name is Damiano.
>> My surname is Porta.
>>
>> I think the tags will be the same, no?
>>
>> Il 26/Ago/2016 18:24, "Russ, Daniel (NIH/CIT) [E]" <druss@mail.nih.gov
>> <ma...@mail.nih.gov>> ha
>> scritto:
>>
>> If you want to use the part of speech (from the POSTaggerME) as a feature,
>> you will need sentences.
>>
>> Daniel Russ, Ph.D.
>> Staff Scientist, Office of Intramural Research
>> Center for Information Technology
>> National Institutes of Health
>> U.S. Department of Health and Human Services
>> 12 South Drive
>> Bethesda,  MD 20892-5624
>>
>> On Aug 26, 2016, at 12:15 PM, Damiano Porta <damianoporta@gmail.com
>> <ma...@gmail.com><mailto:
>> damianoporta@gmail.com<ma...@gmail.com>>> wrote:
>>
>> Thanks Joern!
>> If i have understood you correctly ...
>> IF i do not need relation between sentences i can skip the sentences
>> detection right?
>>
>> Il 26/Ago/2016 16:33, "Joern Kottmann" <kottmann@gmail.com<mailto:kot
>> tmann@gmail.com><mailto:kot
>> tmann@gmail.com<ma...@gmail.com>>> ha scritto:
>>
>> The name finder has the concept of "adaptive data" in the feature
>> generation. The feature generators can remember things from previous
>> sentences and use it to generate features based on it. Usually that can
>> help with the recognition rate if you have names that are repeated.  You
>> can tweak this to your data, or just pass in the entire document.
>>
>> Jörn
>>
>> On Fri, Aug 26, 2016 at 3:25 PM, Damiano Porta <damianoporta@gmail.com
>> <ma...@gmail.com><
>> mailto:damianoporta@gmail.com>>
>> wrote:
>>
>> Hi!
>> Yes I can train a good model (sure It will takes a lot of time), i have
>> 30k
>> resumes. So the "data" isnt a problem.
>> I thought about many things, i am also creating a custom features
>> generator, with dictionary too (for names) and regex for Birthday,  then
>> the machine learning will look at their contexts.
>> So now i need to separate the sentences to create a custom model.
>> At this point i will not try with one per line CV.
>>
>> Il 26/Ago/2016 15:10, "Russ, Daniel (NIH/CIT) [E]" <druss@mail.nih.gov
>> <ma...@mail.nih.gov>
>> <ma...@mail.nih.gov>>
>> ha
>> scritto:
>>
>> Hi Damiano,
>>  I am not sure that the NameFinder will be effective as-is for you.  Do
>> you have training data (and I mean a lot of training data)?  You need to
>> consider what feature are useful in your case.  You might consider a
>> feature such as line number on the page (since people tend to put their
>> name on the top or second line), maybe the font-size.  You can add a
>> dictionary of common names and have a feature “inDictionary”. You will
>> have
>> to use your domain knowledge to help you here.
>>
>> For birthday you may want to consider using regex to pick out dates.
>> Then look at the context around the date (words before/after, remove
>> graduated or if another date just before) or maybe years before present
>> year (if you are looking at resumes, you probably won’t find any 5 year
>> olds or 200 year olds.
>>
>> Daniel Russ, Ph.D.
>> Staff Scientist, Office of Intramural Research
>> Center for Information Technology
>> National Institutes of Health
>> U.S. Department of Health and Human Services
>> 12 South Drive
>> Bethesda,  MD 20892-5624
>>
>> On Aug 26, 2016, at 5:57 AM, Damiano Porta <damianoporta@gmail.com<mailto
>> :damianoporta@gmail.com><mailto:
>> damianoporta@gmail.com<ma...@gmail.com>><
>> mailto:
>> damianoporta@gmail.com<ma...@gmail.com><mailto:
>> damianoporta@gmail.com>>> wrote:
>>
>> Hi Daniel!
>>
>> Thank you so much for your opinion.
>> It makes perfectly sense. But i am still a bit confused about the length
>> of
>> the sentences.
>> In a resume there are many names, dates etc etc. So my doubt is regarding
>> the structure of the sentences because they follow specific patterns
>> sometimes.
>>
>> For example i need to extract the personal name, (Who wrote the resume)
>> the
>> Birthday etc etc.
>>
>> As You know there are many names and dates inside a resume so i thought
>> about to write the entire resume as sentence to also train the "position"
>> less or more of the entities. If i "decompose" all the resume into
>> sentences i will lose this information. No?
>>
>> Damiano
>>
>> Il 25/Ago/2016 16:26, "Russ, Daniel (NIH/CIT) [E]" <druss@mail.nih.gov
>> <ma...@mail.nih.gov>
>> <ma...@mail.nih.gov>
>> <ma...@mail.nih.gov>> ha
>> scritto:
>>
>> Hi Damiano,
>>
>>   Everyone can feel feel to correct my ignorance but I view the the
>> name finder as follows.
>>
>>   I look at it as walking down the sentence and classifying words as
>> “NOT IN NAME”  until I hit the start of a name than it is “START NAME”,
>> Followed by “STILL IN NAME” until “NOT IN NAME”.  Take the sentence “Did
>> John eat the stew”.  Starting with the first word in the sentence decide
>> what are the odds that the first word starts a name (given that it is the
>> first word happens to be “Did” in a sentence, with a capital but not all
>> caps) starts a person’s name.  Then go to then next word in the sentence.
>> If the first word was not in a name, what are the odds that the second
>> word
>> starts a name (given that the previous word did not start a name, the
>> word
>> starts with a capital (but not all capital), the word is John, and the
>> previous word is “Did”).  If it decides that we are starting a name at
>> “John”, we are now looking for the end.  What are the odds that “eat” is
>> part of the name given that [“Did”: was not part of the name, was
>> capitalized] and that [“John”: was the first word in the name, was
>> capitalized].   You are essentially classifying [Did <- OTHER] [John
>> <-START] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  If it was “Did John
>> Smith eat the stew”.  You would have [Did <- OTHER] [John
>> <-START][Smith<-IN] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  There are
>> other features other than just word, previous word, and the shape (first
>> letter capitalized, all letters capitalized).  I think the name finder
>> uses
>> part of speech also.
>>
>>
>>  So you see that it is not a name lookup table, but dependent on the
>> previous classification of words earlier in the sentence.  Therefore, you
>> must have sentences. Does that help?
>> Daniel
>>
>>
>> Daniel Russ, Ph.D.
>> Staff Scientist, Office of Intramural Research
>> Center for Information Technology
>> National Institutes of Health
>> U.S. Department of Health and Human Services
>> 12 South Drive
>> Bethesda,  MD 20892-5624
>>
>> On Aug 25, 2016, at 9:55 AM, Damiano Porta <damianoporta@gmail.com<mailto
>> :damianoporta@gmail.com><mailto:
>> damianoporta@gmail.com<ma...@gmail.com>><
>> mailto:
>> damianoporta@gmail.com<ma...@gmail.com><mailto:
>> damianoporta@gmail.com>><mailto:
>> damianoporta@gmail.com<ma...@gmail.com><mailto:
>> damianoporta@gmail.com><mailto:
>> damianoporta@gmail.com<ma...@gmail.com>>>> wrote:
>>
>> Hello everybody!
>>
>> Could someone explain why should I separate each sentence of my documents
>> to train my models?
>> My documents are like resume/cv and the sentences can be very different.
>> For example a sentence could also be :
>>
>> 1. Name: John
>> 2. Surname: travolta
>>
>> Etc etc
>> So my question is. What is the problem if i train ny models
>> (namefinder,tokenizer) with the complete resume/cv one per line?
>>
>> Could It be a problem?
>> In this case when i will like to tokenize the resume and doing the NER i
>> will simply pass the complete resume text skiping the "sentences
>> detection"
>> process.
>>
>> Thanks for your opinion in advance!
>>
>> Best
>> Damiano
>>
>>

Re: Is sentence detection process really needed?

Posted by Damiano Porta <da...@gmail.com>.
But i think It is the same no? I Mean. ..I will pass all the content as one
sentence. So in this case the "the" word will be tagged the same.

The problem in this case is that i need to create a tagger model too...

Il 26/Ago/2016 20:14, "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov> ha
scritto:

> The POSTaggerME uses tokenized sentences. In your example, both cases have
> 2 sentences. sentence 1=My name is Damiano.  sentence 2=My surname is
> Porta..
>
> POSTaggerME tagger=…
> tagger.tag(new String[]{ “My”,”name”,”is”,”Damiano”});
>
> Daniel Russ, Ph.D.
> Staff Scientist, Office of Intramural Research
> Center for Information Technology
> National Institutes of Health
> U.S. Department of Health and Human Services
> 12 South Drive
> Bethesda,  MD 20892-5624
>
> On Aug 26, 2016, at 1:46 PM, Damiano Porta <damianoporta@gmail.com<mailto:
> damianoporta@gmail.com>> wrote:
>
> Hmmm why?
> If i use the postagger for:
> "My name is Damiano. My surname is Porta"
>
> OR separate:
>
> My name is Damiano.
> My surname is Porta.
>
> I think the tags will be the same, no?
>
> Il 26/Ago/2016 18:24, "Russ, Daniel (NIH/CIT) [E]" <druss@mail.nih.gov
> <ma...@mail.nih.gov>> ha
> scritto:
>
> If you want to use the part of speech (from the POSTaggerME) as a feature,
> you will need sentences.
>
> Daniel Russ, Ph.D.
> Staff Scientist, Office of Intramural Research
> Center for Information Technology
> National Institutes of Health
> U.S. Department of Health and Human Services
> 12 South Drive
> Bethesda,  MD 20892-5624
>
> On Aug 26, 2016, at 12:15 PM, Damiano Porta <damianoporta@gmail.com<
> mailto:damianoporta@gmail.com><mailto:
> damianoporta@gmail.com<ma...@gmail.com>>> wrote:
>
> Thanks Joern!
> If i have understood you correctly ...
> IF i do not need relation between sentences i can skip the sentences
> detection right?
>
> Il 26/Ago/2016 16:33, "Joern Kottmann" <kottmann@gmail.com<mailto:kot
> tmann@gmail.com><mailto:kot
> tmann@gmail.com<ma...@gmail.com>>> ha scritto:
>
> The name finder has the concept of "adaptive data" in the feature
> generation. The feature generators can remember things from previous
> sentences and use it to generate features based on it. Usually that can
> help with the recognition rate if you have names that are repeated.  You
> can tweak this to your data, or just pass in the entire document.
>
> Jörn
>
> On Fri, Aug 26, 2016 at 3:25 PM, Damiano Porta <damianoporta@gmail.com<
> mailto:damianoporta@gmail.com><
> mailto:damianoporta@gmail.com>>
> wrote:
>
> Hi!
> Yes I can train a good model (sure It will takes a lot of time), i have
> 30k
> resumes. So the "data" isnt a problem.
> I thought about many things, i am also creating a custom features
> generator, with dictionary too (for names) and regex for Birthday,  then
> the machine learning will look at their contexts.
> So now i need to separate the sentences to create a custom model.
> At this point i will not try with one per line CV.
>
> Il 26/Ago/2016 15:10, "Russ, Daniel (NIH/CIT) [E]" <druss@mail.nih.gov
> <ma...@mail.nih.gov>
> <ma...@mail.nih.gov>>
> ha
> scritto:
>
> Hi Damiano,
>  I am not sure that the NameFinder will be effective as-is for you.  Do
> you have training data (and I mean a lot of training data)?  You need to
> consider what feature are useful in your case.  You might consider a
> feature such as line number on the page (since people tend to put their
> name on the top or second line), maybe the font-size.  You can add a
> dictionary of common names and have a feature “inDictionary”. You will
> have
> to use your domain knowledge to help you here.
>
> For birthday you may want to consider using regex to pick out dates.
> Then look at the context around the date (words before/after, remove
> graduated or if another date just before) or maybe years before present
> year (if you are looking at resumes, you probably won’t find any 5 year
> olds or 200 year olds.
>
> Daniel Russ, Ph.D.
> Staff Scientist, Office of Intramural Research
> Center for Information Technology
> National Institutes of Health
> U.S. Department of Health and Human Services
> 12 South Drive
> Bethesda,  MD 20892-5624
>
> On Aug 26, 2016, at 5:57 AM, Damiano Porta <damianoporta@gmail.com<mailto:
> damianoporta@gmail.com><mailto:
> damianoporta@gmail.com<ma...@gmail.com>><
> mailto:
> damianoporta@gmail.com<ma...@gmail.com><mailto:
> damianoporta@gmail.com>>> wrote:
>
> Hi Daniel!
>
> Thank you so much for your opinion.
> It makes perfectly sense. But i am still a bit confused about the length
> of
> the sentences.
> In a resume there are many names, dates etc etc. So my doubt is regarding
> the structure of the sentences because they follow specific patterns
> sometimes.
>
> For example i need to extract the personal name, (Who wrote the resume)
> the
> Birthday etc etc.
>
> As You know there are many names and dates inside a resume so i thought
> about to write the entire resume as sentence to also train the "position"
> less or more of the entities. If i "decompose" all the resume into
> sentences i will lose this information. No?
>
> Damiano
>
> Il 25/Ago/2016 16:26, "Russ, Daniel (NIH/CIT) [E]" <druss@mail.nih.gov
> <ma...@mail.nih.gov>
> <ma...@mail.nih.gov>
> <ma...@mail.nih.gov>> ha
> scritto:
>
> Hi Damiano,
>
>   Everyone can feel feel to correct my ignorance but I view the the
> name finder as follows.
>
>   I look at it as walking down the sentence and classifying words as
> “NOT IN NAME”  until I hit the start of a name than it is “START NAME”,
> Followed by “STILL IN NAME” until “NOT IN NAME”.  Take the sentence “Did
> John eat the stew”.  Starting with the first word in the sentence decide
> what are the odds that the first word starts a name (given that it is the
> first word happens to be “Did” in a sentence, with a capital but not all
> caps) starts a person’s name.  Then go to then next word in the sentence.
> If the first word was not in a name, what are the odds that the second
> word
> starts a name (given that the previous word did not start a name, the
> word
> starts with a capital (but not all capital), the word is John, and the
> previous word is “Did”).  If it decides that we are starting a name at
> “John”, we are now looking for the end.  What are the odds that “eat” is
> part of the name given that [“Did”: was not part of the name, was
> capitalized] and that [“John”: was the first word in the name, was
> capitalized].   You are essentially classifying [Did <- OTHER] [John
> <-START] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  If it was “Did John
> Smith eat the stew”.  You would have [Did <- OTHER] [John
> <-START][Smith<-IN] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  There are
> other features other than just word, previous word, and the shape (first
> letter capitalized, all letters capitalized).  I think the name finder
> uses
> part of speech also.
>
>
>  So you see that it is not a name lookup table, but dependent on the
> previous classification of words earlier in the sentence.  Therefore, you
> must have sentences. Does that help?
> Daniel
>
>
> Daniel Russ, Ph.D.
> Staff Scientist, Office of Intramural Research
> Center for Information Technology
> National Institutes of Health
> U.S. Department of Health and Human Services
> 12 South Drive
> Bethesda,  MD 20892-5624
>
> On Aug 25, 2016, at 9:55 AM, Damiano Porta <damianoporta@gmail.com<mailto:
> damianoporta@gmail.com><mailto:
> damianoporta@gmail.com<ma...@gmail.com>><
> mailto:
> damianoporta@gmail.com<ma...@gmail.com><mailto:
> damianoporta@gmail.com>><mailto:
> damianoporta@gmail.com<ma...@gmail.com><mailto:
> damianoporta@gmail.com><mailto:
> damianoporta@gmail.com<ma...@gmail.com>>>> wrote:
>
> Hello everybody!
>
> Could someone explain why should I separate each sentence of my documents
> to train my models?
> My documents are like resume/cv and the sentences can be very different.
> For example a sentence could also be :
>
> 1. Name: John
> 2. Surname: travolta
>
> Etc etc
> So my question is. What is the problem if i train ny models
> (namefinder,tokenizer) with the complete resume/cv one per line?
>
> Could It be a problem?
> In this case when i will like to tokenize the resume and doing the NER i
> will simply pass the complete resume text skiping the "sentences
> detection"
> process.
>
> Thanks for your opinion in advance!
>
> Best
> Damiano
>
>

Re: Is sentence detection process really needed?

Posted by "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov>.
The POSTaggerME uses tokenized sentences. In your example, both cases have 2 sentences. sentence 1=My name is Damiano.  sentence 2=My surname is Porta..

POSTaggerME tagger=…
tagger.tag(new String[]{ “My”,”name”,”is”,”Damiano”});

Daniel Russ, Ph.D.
Staff Scientist, Office of Intramural Research
Center for Information Technology
National Institutes of Health
U.S. Department of Health and Human Services
12 South Drive
Bethesda,  MD 20892-5624

On Aug 26, 2016, at 1:46 PM, Damiano Porta <da...@gmail.com>> wrote:

Hmmm why?
If i use the postagger for:
"My name is Damiano. My surname is Porta"

OR separate:

My name is Damiano.
My surname is Porta.

I think the tags will be the same, no?

Il 26/Ago/2016 18:24, "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov>> ha
scritto:

If you want to use the part of speech (from the POSTaggerME) as a feature,
you will need sentences.

Daniel Russ, Ph.D.
Staff Scientist, Office of Intramural Research
Center for Information Technology
National Institutes of Health
U.S. Department of Health and Human Services
12 South Drive
Bethesda,  MD 20892-5624

On Aug 26, 2016, at 12:15 PM, Damiano Porta <da...@gmail.com><mailto:
damianoporta@gmail.com<ma...@gmail.com>>> wrote:

Thanks Joern!
If i have understood you correctly ...
IF i do not need relation between sentences i can skip the sentences
detection right?

Il 26/Ago/2016 16:33, "Joern Kottmann" <ko...@gmail.com><mailto:kot
tmann@gmail.com<ma...@gmail.com>>> ha scritto:

The name finder has the concept of "adaptive data" in the feature
generation. The feature generators can remember things from previous
sentences and use it to generate features based on it. Usually that can
help with the recognition rate if you have names that are repeated.  You
can tweak this to your data, or just pass in the entire document.

Jörn

On Fri, Aug 26, 2016 at 3:25 PM, Damiano Porta <da...@gmail.com><
mailto:damianoporta@gmail.com>>
wrote:

Hi!
Yes I can train a good model (sure It will takes a lot of time), i have
30k
resumes. So the "data" isnt a problem.
I thought about many things, i am also creating a custom features
generator, with dictionary too (for names) and regex for Birthday,  then
the machine learning will look at their contexts.
So now i need to separate the sentences to create a custom model.
At this point i will not try with one per line CV.

Il 26/Ago/2016 15:10, "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov>
<ma...@mail.nih.gov>>
ha
scritto:

Hi Damiano,
 I am not sure that the NameFinder will be effective as-is for you.  Do
you have training data (and I mean a lot of training data)?  You need to
consider what feature are useful in your case.  You might consider a
feature such as line number on the page (since people tend to put their
name on the top or second line), maybe the font-size.  You can add a
dictionary of common names and have a feature “inDictionary”. You will
have
to use your domain knowledge to help you here.

For birthday you may want to consider using regex to pick out dates.
Then look at the context around the date (words before/after, remove
graduated or if another date just before) or maybe years before present
year (if you are looking at resumes, you probably won’t find any 5 year
olds or 200 year olds.

Daniel Russ, Ph.D.
Staff Scientist, Office of Intramural Research
Center for Information Technology
National Institutes of Health
U.S. Department of Health and Human Services
12 South Drive
Bethesda,  MD 20892-5624

On Aug 26, 2016, at 5:57 AM, Damiano Porta <da...@gmail.com><mailto:
damianoporta@gmail.com<ma...@gmail.com>><
mailto:
damianoporta@gmail.com<ma...@gmail.com>>> wrote:

Hi Daniel!

Thank you so much for your opinion.
It makes perfectly sense. But i am still a bit confused about the length
of
the sentences.
In a resume there are many names, dates etc etc. So my doubt is regarding
the structure of the sentences because they follow specific patterns
sometimes.

For example i need to extract the personal name, (Who wrote the resume)
the
Birthday etc etc.

As You know there are many names and dates inside a resume so i thought
about to write the entire resume as sentence to also train the "position"
less or more of the entities. If i "decompose" all the resume into
sentences i will lose this information. No?

Damiano

Il 25/Ago/2016 16:26, "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov>
<ma...@mail.nih.gov>
<ma...@mail.nih.gov>> ha
scritto:

Hi Damiano,

  Everyone can feel feel to correct my ignorance but I view the the
name finder as follows.

  I look at it as walking down the sentence and classifying words as
“NOT IN NAME”  until I hit the start of a name than it is “START NAME”,
Followed by “STILL IN NAME” until “NOT IN NAME”.  Take the sentence “Did
John eat the stew”.  Starting with the first word in the sentence decide
what are the odds that the first word starts a name (given that it is the
first word happens to be “Did” in a sentence, with a capital but not all
caps) starts a person’s name.  Then go to then next word in the sentence.
If the first word was not in a name, what are the odds that the second
word
starts a name (given that the previous word did not start a name, the
word
starts with a capital (but not all capital), the word is John, and the
previous word is “Did”).  If it decides that we are starting a name at
“John”, we are now looking for the end.  What are the odds that “eat” is
part of the name given that [“Did”: was not part of the name, was
capitalized] and that [“John”: was the first word in the name, was
capitalized].   You are essentially classifying [Did <- OTHER] [John
<-START] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  If it was “Did John
Smith eat the stew”.  You would have [Did <- OTHER] [John
<-START][Smith<-IN] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  There are
other features other than just word, previous word, and the shape (first
letter capitalized, all letters capitalized).  I think the name finder
uses
part of speech also.


 So you see that it is not a name lookup table, but dependent on the
previous classification of words earlier in the sentence.  Therefore, you
must have sentences. Does that help?
Daniel


Daniel Russ, Ph.D.
Staff Scientist, Office of Intramural Research
Center for Information Technology
National Institutes of Health
U.S. Department of Health and Human Services
12 South Drive
Bethesda,  MD 20892-5624

On Aug 25, 2016, at 9:55 AM, Damiano Porta <da...@gmail.com><mailto:
damianoporta@gmail.com<ma...@gmail.com>><
mailto:
damianoporta@gmail.com<ma...@gmail.com>><mailto:
damianoporta@gmail.com<ma...@gmail.com><mailto:
damianoporta@gmail.com<ma...@gmail.com>>>> wrote:

Hello everybody!

Could someone explain why should I separate each sentence of my documents
to train my models?
My documents are like resume/cv and the sentences can be very different.
For example a sentence could also be :

1. Name: John
2. Surname: travolta

Etc etc
So my question is. What is the problem if i train ny models
(namefinder,tokenizer) with the complete resume/cv one per line?

Could It be a problem?
In this case when i will like to tokenize the resume and doing the NER i
will simply pass the complete resume text skiping the "sentences
detection"
process.

Thanks for your opinion in advance!

Best
Damiano


Re: Is sentence detection process really needed?

Posted by Damiano Porta <da...@gmail.com>.
Hmmm why?
If i use the postagger for:
"My name is Damiano. My surname is Porta"

OR separate:

My name is Damiano.
My surname is Porta.

I think the tags will be the same, no?

Il 26/Ago/2016 18:24, "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov> ha
scritto:

If you want to use the part of speech (from the POSTaggerME) as a feature,
you will need sentences.

Daniel Russ, Ph.D.
Staff Scientist, Office of Intramural Research
Center for Information Technology
National Institutes of Health
U.S. Department of Health and Human Services
12 South Drive
Bethesda,  MD 20892-5624

On Aug 26, 2016, at 12:15 PM, Damiano Porta <damianoporta@gmail.com<mailto:
damianoporta@gmail.com>> wrote:

Thanks Joern!
If i have understood you correctly ...
IF i do not need relation between sentences i can skip the sentences
detection right?

Il 26/Ago/2016 16:33, "Joern Kottmann" <kottmann@gmail.com<mailto:kot
tmann@gmail.com>> ha scritto:

The name finder has the concept of "adaptive data" in the feature
generation. The feature generators can remember things from previous
sentences and use it to generate features based on it. Usually that can
help with the recognition rate if you have names that are repeated.  You
can tweak this to your data, or just pass in the entire document.

Jörn

On Fri, Aug 26, 2016 at 3:25 PM, Damiano Porta <damianoporta@gmail.com<
mailto:damianoporta@gmail.com>>
wrote:

Hi!
Yes I can train a good model (sure It will takes a lot of time), i have
30k
resumes. So the "data" isnt a problem.
I thought about many things, i am also creating a custom features
generator, with dictionary too (for names) and regex for Birthday,  then
the machine learning will look at their contexts.
So now i need to separate the sentences to create a custom model.
At this point i will not try with one per line CV.

Il 26/Ago/2016 15:10, "Russ, Daniel (NIH/CIT) [E]" <druss@mail.nih.gov
<ma...@mail.nih.gov>>
ha
scritto:

Hi Damiano,
  I am not sure that the NameFinder will be effective as-is for you.  Do
you have training data (and I mean a lot of training data)?  You need to
consider what feature are useful in your case.  You might consider a
feature such as line number on the page (since people tend to put their
name on the top or second line), maybe the font-size.  You can add a
dictionary of common names and have a feature “inDictionary”. You will
have
to use your domain knowledge to help you here.

 For birthday you may want to consider using regex to pick out dates.
Then look at the context around the date (words before/after, remove
graduated or if another date just before) or maybe years before present
year (if you are looking at resumes, you probably won’t find any 5 year
olds or 200 year olds.

Daniel Russ, Ph.D.
Staff Scientist, Office of Intramural Research
Center for Information Technology
National Institutes of Health
U.S. Department of Health and Human Services
12 South Drive
Bethesda,  MD 20892-5624

On Aug 26, 2016, at 5:57 AM, Damiano Porta <damianoporta@gmail.com<mailto:
damianoporta@gmail.com><
mailto:
damianoporta@gmail.com<ma...@gmail.com>>> wrote:

Hi Daniel!

Thank you so much for your opinion.
It makes perfectly sense. But i am still a bit confused about the length
of
the sentences.
In a resume there are many names, dates etc etc. So my doubt is regarding
the structure of the sentences because they follow specific patterns
sometimes.

For example i need to extract the personal name, (Who wrote the resume)
the
Birthday etc etc.

As You know there are many names and dates inside a resume so i thought
about to write the entire resume as sentence to also train the "position"
less or more of the entities. If i "decompose" all the resume into
sentences i will lose this information. No?

Damiano

Il 25/Ago/2016 16:26, "Russ, Daniel (NIH/CIT) [E]" <druss@mail.nih.gov
<ma...@mail.nih.gov>
<ma...@mail.nih.gov>> ha
scritto:

Hi Damiano,

   Everyone can feel feel to correct my ignorance but I view the the
name finder as follows.

   I look at it as walking down the sentence and classifying words as
“NOT IN NAME”  until I hit the start of a name than it is “START NAME”,
Followed by “STILL IN NAME” until “NOT IN NAME”.  Take the sentence “Did
John eat the stew”.  Starting with the first word in the sentence decide
what are the odds that the first word starts a name (given that it is the
first word happens to be “Did” in a sentence, with a capital but not all
caps) starts a person’s name.  Then go to then next word in the sentence.
If the first word was not in a name, what are the odds that the second
word
starts a name (given that the previous word did not start a name, the
word
starts with a capital (but not all capital), the word is John, and the
previous word is “Did”).  If it decides that we are starting a name at
“John”, we are now looking for the end.  What are the odds that “eat” is
part of the name given that [“Did”: was not part of the name, was
capitalized] and that [“John”: was the first word in the name, was
capitalized].   You are essentially classifying [Did <- OTHER] [John
<-START] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  If it was “Did John
Smith eat the stew”.  You would have [Did <- OTHER] [John
<-START][Smith<-IN] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  There are
other features other than just word, previous word, and the shape (first
letter capitalized, all letters capitalized).  I think the name finder
uses
part of speech also.


  So you see that it is not a name lookup table, but dependent on the
previous classification of words earlier in the sentence.  Therefore, you
must have sentences. Does that help?
Daniel


Daniel Russ, Ph.D.
Staff Scientist, Office of Intramural Research
Center for Information Technology
National Institutes of Health
U.S. Department of Health and Human Services
12 South Drive
Bethesda,  MD 20892-5624

On Aug 25, 2016, at 9:55 AM, Damiano Porta <damianoporta@gmail.com<mailto:
damianoporta@gmail.com><
mailto:
damianoporta@gmail.com<ma...@gmail.com>><mailto:
damianoporta@gmail.com<ma...@gmail.com><mailto:
damianoporta@gmail.com>>> wrote:

Hello everybody!

Could someone explain why should I separate each sentence of my documents
to train my models?
My documents are like resume/cv and the sentences can be very different.
For example a sentence could also be :

1. Name: John
2. Surname: travolta

Etc etc
So my question is. What is the problem if i train ny models
(namefinder,tokenizer) with the complete resume/cv one per line?

Could It be a problem?
In this case when i will like to tokenize the resume and doing the NER i
will simply pass the complete resume text skiping the "sentences
detection"
process.

Thanks for your opinion in advance!

Best
Damiano

Re: Is sentence detection process really needed?

Posted by "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov>.
If you want to use the part of speech (from the POSTaggerME) as a feature, you will need sentences.

Daniel Russ, Ph.D.
Staff Scientist, Office of Intramural Research
Center for Information Technology
National Institutes of Health
U.S. Department of Health and Human Services
12 South Drive
Bethesda,  MD 20892-5624

On Aug 26, 2016, at 12:15 PM, Damiano Porta <da...@gmail.com>> wrote:

Thanks Joern!
If i have understood you correctly ...
IF i do not need relation between sentences i can skip the sentences
detection right?

Il 26/Ago/2016 16:33, "Joern Kottmann" <ko...@gmail.com>> ha scritto:

The name finder has the concept of "adaptive data" in the feature
generation. The feature generators can remember things from previous
sentences and use it to generate features based on it. Usually that can
help with the recognition rate if you have names that are repeated.  You
can tweak this to your data, or just pass in the entire document.

Jörn

On Fri, Aug 26, 2016 at 3:25 PM, Damiano Porta <da...@gmail.com>>
wrote:

Hi!
Yes I can train a good model (sure It will takes a lot of time), i have
30k
resumes. So the "data" isnt a problem.
I thought about many things, i am also creating a custom features
generator, with dictionary too (for names) and regex for Birthday,  then
the machine learning will look at their contexts.
So now i need to separate the sentences to create a custom model.
At this point i will not try with one per line CV.

Il 26/Ago/2016 15:10, "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov>>
ha
scritto:

Hi Damiano,
  I am not sure that the NameFinder will be effective as-is for you.  Do
you have training data (and I mean a lot of training data)?  You need to
consider what feature are useful in your case.  You might consider a
feature such as line number on the page (since people tend to put their
name on the top or second line), maybe the font-size.  You can add a
dictionary of common names and have a feature “inDictionary”. You will
have
to use your domain knowledge to help you here.

 For birthday you may want to consider using regex to pick out dates.
Then look at the context around the date (words before/after, remove
graduated or if another date just before) or maybe years before present
year (if you are looking at resumes, you probably won’t find any 5 year
olds or 200 year olds.

Daniel Russ, Ph.D.
Staff Scientist, Office of Intramural Research
Center for Information Technology
National Institutes of Health
U.S. Department of Health and Human Services
12 South Drive
Bethesda,  MD 20892-5624

On Aug 26, 2016, at 5:57 AM, Damiano Porta <da...@gmail.com><
mailto:
damianoporta@gmail.com<ma...@gmail.com>>> wrote:

Hi Daniel!

Thank you so much for your opinion.
It makes perfectly sense. But i am still a bit confused about the length
of
the sentences.
In a resume there are many names, dates etc etc. So my doubt is regarding
the structure of the sentences because they follow specific patterns
sometimes.

For example i need to extract the personal name, (Who wrote the resume)
the
Birthday etc etc.

As You know there are many names and dates inside a resume so i thought
about to write the entire resume as sentence to also train the "position"
less or more of the entities. If i "decompose" all the resume into
sentences i will lose this information. No?

Damiano

Il 25/Ago/2016 16:26, "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov>
<ma...@mail.nih.gov>> ha
scritto:

Hi Damiano,

   Everyone can feel feel to correct my ignorance but I view the the
name finder as follows.

   I look at it as walking down the sentence and classifying words as
“NOT IN NAME”  until I hit the start of a name than it is “START NAME”,
Followed by “STILL IN NAME” until “NOT IN NAME”.  Take the sentence “Did
John eat the stew”.  Starting with the first word in the sentence decide
what are the odds that the first word starts a name (given that it is the
first word happens to be “Did” in a sentence, with a capital but not all
caps) starts a person’s name.  Then go to then next word in the sentence.
If the first word was not in a name, what are the odds that the second
word
starts a name (given that the previous word did not start a name, the
word
starts with a capital (but not all capital), the word is John, and the
previous word is “Did”).  If it decides that we are starting a name at
“John”, we are now looking for the end.  What are the odds that “eat” is
part of the name given that [“Did”: was not part of the name, was
capitalized] and that [“John”: was the first word in the name, was
capitalized].   You are essentially classifying [Did <- OTHER] [John
<-START] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  If it was “Did John
Smith eat the stew”.  You would have [Did <- OTHER] [John
<-START][Smith<-IN] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  There are
other features other than just word, previous word, and the shape (first
letter capitalized, all letters capitalized).  I think the name finder
uses
part of speech also.


  So you see that it is not a name lookup table, but dependent on the
previous classification of words earlier in the sentence.  Therefore, you
must have sentences. Does that help?
Daniel


Daniel Russ, Ph.D.
Staff Scientist, Office of Intramural Research
Center for Information Technology
National Institutes of Health
U.S. Department of Health and Human Services
12 South Drive
Bethesda,  MD 20892-5624

On Aug 25, 2016, at 9:55 AM, Damiano Porta <da...@gmail.com><
mailto:
damianoporta@gmail.com<ma...@gmail.com>><mailto:
damianoporta@gmail.com<ma...@gmail.com>>> wrote:

Hello everybody!

Could someone explain why should I separate each sentence of my documents
to train my models?
My documents are like resume/cv and the sentences can be very different.
For example a sentence could also be :

1. Name: John
2. Surname: travolta

Etc etc
So my question is. What is the problem if i train ny models
(namefinder,tokenizer) with the complete resume/cv one per line?

Could It be a problem?
In this case when i will like to tokenize the resume and doing the NER i
will simply pass the complete resume text skiping the "sentences
detection"
process.

Thanks for your opinion in advance!

Best
Damiano




Re: Is sentence detection process really needed?

Posted by Damiano Porta <da...@gmail.com>.
Thanks Joern!
If i have understood you correctly ...
IF i do not need relation between sentences i can skip the sentences
detection right?

Il 26/Ago/2016 16:33, "Joern Kottmann" <ko...@gmail.com> ha scritto:

> The name finder has the concept of "adaptive data" in the feature
> generation. The feature generators can remember things from previous
> sentences and use it to generate features based on it. Usually that can
> help with the recognition rate if you have names that are repeated.  You
> can tweak this to your data, or just pass in the entire document.
>
> Jörn
>
> On Fri, Aug 26, 2016 at 3:25 PM, Damiano Porta <da...@gmail.com>
> wrote:
>
> > Hi!
> > Yes I can train a good model (sure It will takes a lot of time), i have
> 30k
> > resumes. So the "data" isnt a problem.
> > I thought about many things, i am also creating a custom features
> > generator, with dictionary too (for names) and regex for Birthday,  then
> > the machine learning will look at their contexts.
> > So now i need to separate the sentences to create a custom model.
> > At this point i will not try with one per line CV.
> >
> > Il 26/Ago/2016 15:10, "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov>
> ha
> > scritto:
> >
> > Hi Damiano,
> >    I am not sure that the NameFinder will be effective as-is for you.  Do
> > you have training data (and I mean a lot of training data)?  You need to
> > consider what feature are useful in your case.  You might consider a
> > feature such as line number on the page (since people tend to put their
> > name on the top or second line), maybe the font-size.  You can add a
> > dictionary of common names and have a feature “inDictionary”. You will
> have
> > to use your domain knowledge to help you here.
> >
> >   For birthday you may want to consider using regex to pick out dates.
> > Then look at the context around the date (words before/after, remove
> > graduated or if another date just before) or maybe years before present
> > year (if you are looking at resumes, you probably won’t find any 5 year
> > olds or 200 year olds.
> >
> > Daniel Russ, Ph.D.
> > Staff Scientist, Office of Intramural Research
> > Center for Information Technology
> > National Institutes of Health
> > U.S. Department of Health and Human Services
> > 12 South Drive
> > Bethesda,  MD 20892-5624
> >
> > On Aug 26, 2016, at 5:57 AM, Damiano Porta <damianoporta@gmail.com<
> mailto:
> > damianoporta@gmail.com>> wrote:
> >
> > Hi Daniel!
> >
> > Thank you so much for your opinion.
> > It makes perfectly sense. But i am still a bit confused about the length
> of
> > the sentences.
> > In a resume there are many names, dates etc etc. So my doubt is regarding
> > the structure of the sentences because they follow specific patterns
> > sometimes.
> >
> > For example i need to extract the personal name, (Who wrote the resume)
> the
> > Birthday etc etc.
> >
> > As You know there are many names and dates inside a resume so i thought
> > about to write the entire resume as sentence to also train the "position"
> > less or more of the entities. If i "decompose" all the resume into
> > sentences i will lose this information. No?
> >
> > Damiano
> >
> > Il 25/Ago/2016 16:26, "Russ, Daniel (NIH/CIT) [E]" <druss@mail.nih.gov
> > <ma...@mail.nih.gov>> ha
> > scritto:
> >
> > Hi Damiano,
> >
> >     Everyone can feel feel to correct my ignorance but I view the the
> > name finder as follows.
> >
> >     I look at it as walking down the sentence and classifying words as
> > “NOT IN NAME”  until I hit the start of a name than it is “START NAME”,
> > Followed by “STILL IN NAME” until “NOT IN NAME”.  Take the sentence “Did
> > John eat the stew”.  Starting with the first word in the sentence decide
> > what are the odds that the first word starts a name (given that it is the
> > first word happens to be “Did” in a sentence, with a capital but not all
> > caps) starts a person’s name.  Then go to then next word in the sentence.
> > If the first word was not in a name, what are the odds that the second
> word
> > starts a name (given that the previous word did not start a name, the
> word
> > starts with a capital (but not all capital), the word is John, and the
> > previous word is “Did”).  If it decides that we are starting a name at
> > “John”, we are now looking for the end.  What are the odds that “eat” is
> > part of the name given that [“Did”: was not part of the name, was
> > capitalized] and that [“John”: was the first word in the name, was
> > capitalized].   You are essentially classifying [Did <- OTHER] [John
> > <-START] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  If it was “Did John
> > Smith eat the stew”.  You would have [Did <- OTHER] [John
> > <-START][Smith<-IN] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  There are
> > other features other than just word, previous word, and the shape (first
> > letter capitalized, all letters capitalized).  I think the name finder
> uses
> > part of speech also.
> >
> >
> >    So you see that it is not a name lookup table, but dependent on the
> > previous classification of words earlier in the sentence.  Therefore, you
> > must have sentences. Does that help?
> > Daniel
> >
> >
> > Daniel Russ, Ph.D.
> > Staff Scientist, Office of Intramural Research
> > Center for Information Technology
> > National Institutes of Health
> > U.S. Department of Health and Human Services
> > 12 South Drive
> > Bethesda,  MD 20892-5624
> >
> > On Aug 25, 2016, at 9:55 AM, Damiano Porta <damianoporta@gmail.com<
> mailto:
> > damianoporta@gmail.com><mailto:
> > damianoporta@gmail.com<ma...@gmail.com>>> wrote:
> >
> > Hello everybody!
> >
> > Could someone explain why should I separate each sentence of my documents
> > to train my models?
> > My documents are like resume/cv and the sentences can be very different.
> > For example a sentence could also be :
> >
> > 1. Name: John
> > 2. Surname: travolta
> >
> > Etc etc
> > So my question is. What is the problem if i train ny models
> > (namefinder,tokenizer) with the complete resume/cv one per line?
> >
> > Could It be a problem?
> > In this case when i will like to tokenize the resume and doing the NER i
> > will simply pass the complete resume text skiping the "sentences
> detection"
> > process.
> >
> > Thanks for your opinion in advance!
> >
> > Best
> > Damiano
> >
>

Re: Is sentence detection process really needed?

Posted by Joern Kottmann <ko...@gmail.com>.
The name finder has the concept of "adaptive data" in the feature
generation. The feature generators can remember things from previous
sentences and use it to generate features based on it. Usually that can
help with the recognition rate if you have names that are repeated.  You
can tweak this to your data, or just pass in the entire document.

Jörn

On Fri, Aug 26, 2016 at 3:25 PM, Damiano Porta <da...@gmail.com>
wrote:

> Hi!
> Yes I can train a good model (sure It will takes a lot of time), i have 30k
> resumes. So the "data" isnt a problem.
> I thought about many things, i am also creating a custom features
> generator, with dictionary too (for names) and regex for Birthday,  then
> the machine learning will look at their contexts.
> So now i need to separate the sentences to create a custom model.
> At this point i will not try with one per line CV.
>
> Il 26/Ago/2016 15:10, "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov> ha
> scritto:
>
> Hi Damiano,
>    I am not sure that the NameFinder will be effective as-is for you.  Do
> you have training data (and I mean a lot of training data)?  You need to
> consider what feature are useful in your case.  You might consider a
> feature such as line number on the page (since people tend to put their
> name on the top or second line), maybe the font-size.  You can add a
> dictionary of common names and have a feature “inDictionary”. You will have
> to use your domain knowledge to help you here.
>
>   For birthday you may want to consider using regex to pick out dates.
> Then look at the context around the date (words before/after, remove
> graduated or if another date just before) or maybe years before present
> year (if you are looking at resumes, you probably won’t find any 5 year
> olds or 200 year olds.
>
> Daniel Russ, Ph.D.
> Staff Scientist, Office of Intramural Research
> Center for Information Technology
> National Institutes of Health
> U.S. Department of Health and Human Services
> 12 South Drive
> Bethesda,  MD 20892-5624
>
> On Aug 26, 2016, at 5:57 AM, Damiano Porta <damianoporta@gmail.com<mailto:
> damianoporta@gmail.com>> wrote:
>
> Hi Daniel!
>
> Thank you so much for your opinion.
> It makes perfectly sense. But i am still a bit confused about the length of
> the sentences.
> In a resume there are many names, dates etc etc. So my doubt is regarding
> the structure of the sentences because they follow specific patterns
> sometimes.
>
> For example i need to extract the personal name, (Who wrote the resume) the
> Birthday etc etc.
>
> As You know there are many names and dates inside a resume so i thought
> about to write the entire resume as sentence to also train the "position"
> less or more of the entities. If i "decompose" all the resume into
> sentences i will lose this information. No?
>
> Damiano
>
> Il 25/Ago/2016 16:26, "Russ, Daniel (NIH/CIT) [E]" <druss@mail.nih.gov
> <ma...@mail.nih.gov>> ha
> scritto:
>
> Hi Damiano,
>
>     Everyone can feel feel to correct my ignorance but I view the the
> name finder as follows.
>
>     I look at it as walking down the sentence and classifying words as
> “NOT IN NAME”  until I hit the start of a name than it is “START NAME”,
> Followed by “STILL IN NAME” until “NOT IN NAME”.  Take the sentence “Did
> John eat the stew”.  Starting with the first word in the sentence decide
> what are the odds that the first word starts a name (given that it is the
> first word happens to be “Did” in a sentence, with a capital but not all
> caps) starts a person’s name.  Then go to then next word in the sentence.
> If the first word was not in a name, what are the odds that the second word
> starts a name (given that the previous word did not start a name, the word
> starts with a capital (but not all capital), the word is John, and the
> previous word is “Did”).  If it decides that we are starting a name at
> “John”, we are now looking for the end.  What are the odds that “eat” is
> part of the name given that [“Did”: was not part of the name, was
> capitalized] and that [“John”: was the first word in the name, was
> capitalized].   You are essentially classifying [Did <- OTHER] [John
> <-START] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  If it was “Did John
> Smith eat the stew”.  You would have [Did <- OTHER] [John
> <-START][Smith<-IN] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  There are
> other features other than just word, previous word, and the shape (first
> letter capitalized, all letters capitalized).  I think the name finder uses
> part of speech also.
>
>
>    So you see that it is not a name lookup table, but dependent on the
> previous classification of words earlier in the sentence.  Therefore, you
> must have sentences. Does that help?
> Daniel
>
>
> Daniel Russ, Ph.D.
> Staff Scientist, Office of Intramural Research
> Center for Information Technology
> National Institutes of Health
> U.S. Department of Health and Human Services
> 12 South Drive
> Bethesda,  MD 20892-5624
>
> On Aug 25, 2016, at 9:55 AM, Damiano Porta <damianoporta@gmail.com<mailto:
> damianoporta@gmail.com><mailto:
> damianoporta@gmail.com<ma...@gmail.com>>> wrote:
>
> Hello everybody!
>
> Could someone explain why should I separate each sentence of my documents
> to train my models?
> My documents are like resume/cv and the sentences can be very different.
> For example a sentence could also be :
>
> 1. Name: John
> 2. Surname: travolta
>
> Etc etc
> So my question is. What is the problem if i train ny models
> (namefinder,tokenizer) with the complete resume/cv one per line?
>
> Could It be a problem?
> In this case when i will like to tokenize the resume and doing the NER i
> will simply pass the complete resume text skiping the "sentences detection"
> process.
>
> Thanks for your opinion in advance!
>
> Best
> Damiano
>

Re: Is sentence detection process really needed?

Posted by Damiano Porta <da...@gmail.com>.
Hi!
Yes I can train a good model (sure It will takes a lot of time), i have 30k
resumes. So the "data" isnt a problem.
I thought about many things, i am also creating a custom features
generator, with dictionary too (for names) and regex for Birthday,  then
the machine learning will look at their contexts.
So now i need to separate the sentences to create a custom model.
At this point i will not try with one per line CV.

Il 26/Ago/2016 15:10, "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov> ha
scritto:

Hi Damiano,
   I am not sure that the NameFinder will be effective as-is for you.  Do
you have training data (and I mean a lot of training data)?  You need to
consider what feature are useful in your case.  You might consider a
feature such as line number on the page (since people tend to put their
name on the top or second line), maybe the font-size.  You can add a
dictionary of common names and have a feature “inDictionary”. You will have
to use your domain knowledge to help you here.

  For birthday you may want to consider using regex to pick out dates.
Then look at the context around the date (words before/after, remove
graduated or if another date just before) or maybe years before present
year (if you are looking at resumes, you probably won’t find any 5 year
olds or 200 year olds.

Daniel Russ, Ph.D.
Staff Scientist, Office of Intramural Research
Center for Information Technology
National Institutes of Health
U.S. Department of Health and Human Services
12 South Drive
Bethesda,  MD 20892-5624

On Aug 26, 2016, at 5:57 AM, Damiano Porta <damianoporta@gmail.com<mailto:
damianoporta@gmail.com>> wrote:

Hi Daniel!

Thank you so much for your opinion.
It makes perfectly sense. But i am still a bit confused about the length of
the sentences.
In a resume there are many names, dates etc etc. So my doubt is regarding
the structure of the sentences because they follow specific patterns
sometimes.

For example i need to extract the personal name, (Who wrote the resume) the
Birthday etc etc.

As You know there are many names and dates inside a resume so i thought
about to write the entire resume as sentence to also train the "position"
less or more of the entities. If i "decompose" all the resume into
sentences i will lose this information. No?

Damiano

Il 25/Ago/2016 16:26, "Russ, Daniel (NIH/CIT) [E]" <druss@mail.nih.gov
<ma...@mail.nih.gov>> ha
scritto:

Hi Damiano,

    Everyone can feel feel to correct my ignorance but I view the the
name finder as follows.

    I look at it as walking down the sentence and classifying words as
“NOT IN NAME”  until I hit the start of a name than it is “START NAME”,
Followed by “STILL IN NAME” until “NOT IN NAME”.  Take the sentence “Did
John eat the stew”.  Starting with the first word in the sentence decide
what are the odds that the first word starts a name (given that it is the
first word happens to be “Did” in a sentence, with a capital but not all
caps) starts a person’s name.  Then go to then next word in the sentence.
If the first word was not in a name, what are the odds that the second word
starts a name (given that the previous word did not start a name, the word
starts with a capital (but not all capital), the word is John, and the
previous word is “Did”).  If it decides that we are starting a name at
“John”, we are now looking for the end.  What are the odds that “eat” is
part of the name given that [“Did”: was not part of the name, was
capitalized] and that [“John”: was the first word in the name, was
capitalized].   You are essentially classifying [Did <- OTHER] [John
<-START] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  If it was “Did John
Smith eat the stew”.  You would have [Did <- OTHER] [John
<-START][Smith<-IN] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  There are
other features other than just word, previous word, and the shape (first
letter capitalized, all letters capitalized).  I think the name finder uses
part of speech also.


   So you see that it is not a name lookup table, but dependent on the
previous classification of words earlier in the sentence.  Therefore, you
must have sentences. Does that help?
Daniel


Daniel Russ, Ph.D.
Staff Scientist, Office of Intramural Research
Center for Information Technology
National Institutes of Health
U.S. Department of Health and Human Services
12 South Drive
Bethesda,  MD 20892-5624

On Aug 25, 2016, at 9:55 AM, Damiano Porta <damianoporta@gmail.com<mailto:
damianoporta@gmail.com><mailto:
damianoporta@gmail.com<ma...@gmail.com>>> wrote:

Hello everybody!

Could someone explain why should I separate each sentence of my documents
to train my models?
My documents are like resume/cv and the sentences can be very different.
For example a sentence could also be :

1. Name: John
2. Surname: travolta

Etc etc
So my question is. What is the problem if i train ny models
(namefinder,tokenizer) with the complete resume/cv one per line?

Could It be a problem?
In this case when i will like to tokenize the resume and doing the NER i
will simply pass the complete resume text skiping the "sentences detection"
process.

Thanks for your opinion in advance!

Best
Damiano

Re: Is sentence detection process really needed?

Posted by "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov>.
Hi Damiano,
   I am not sure that the NameFinder will be effective as-is for you.  Do you have training data (and I mean a lot of training data)?  You need to consider what feature are useful in your case.  You might consider a feature such as line number on the page (since people tend to put their name on the top or second line), maybe the font-size.  You can add a dictionary of common names and have a feature “inDictionary”. You will have to use your domain knowledge to help you here.

  For birthday you may want to consider using regex to pick out dates.  Then look at the context around the date (words before/after, remove graduated or if another date just before) or maybe years before present year (if you are looking at resumes, you probably won’t find any 5 year olds or 200 year olds.

Daniel Russ, Ph.D.
Staff Scientist, Office of Intramural Research
Center for Information Technology
National Institutes of Health
U.S. Department of Health and Human Services
12 South Drive
Bethesda,  MD 20892-5624

On Aug 26, 2016, at 5:57 AM, Damiano Porta <da...@gmail.com>> wrote:

Hi Daniel!

Thank you so much for your opinion.
It makes perfectly sense. But i am still a bit confused about the length of
the sentences.
In a resume there are many names, dates etc etc. So my doubt is regarding
the structure of the sentences because they follow specific patterns
sometimes.

For example i need to extract the personal name, (Who wrote the resume) the
Birthday etc etc.

As You know there are many names and dates inside a resume so i thought
about to write the entire resume as sentence to also train the "position"
less or more of the entities. If i "decompose" all the resume into
sentences i will lose this information. No?

Damiano

Il 25/Ago/2016 16:26, "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov>> ha
scritto:

Hi Damiano,

    Everyone can feel feel to correct my ignorance but I view the the
name finder as follows.

    I look at it as walking down the sentence and classifying words as
“NOT IN NAME”  until I hit the start of a name than it is “START NAME”,
Followed by “STILL IN NAME” until “NOT IN NAME”.  Take the sentence “Did
John eat the stew”.  Starting with the first word in the sentence decide
what are the odds that the first word starts a name (given that it is the
first word happens to be “Did” in a sentence, with a capital but not all
caps) starts a person’s name.  Then go to then next word in the sentence.
If the first word was not in a name, what are the odds that the second word
starts a name (given that the previous word did not start a name, the word
starts with a capital (but not all capital), the word is John, and the
previous word is “Did”).  If it decides that we are starting a name at
“John”, we are now looking for the end.  What are the odds that “eat” is
part of the name given that [“Did”: was not part of the name, was
capitalized] and that [“John”: was the first word in the name, was
capitalized].   You are essentially classifying [Did <- OTHER] [John
<-START] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  If it was “Did John
Smith eat the stew”.  You would have [Did <- OTHER] [John
<-START][Smith<-IN] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  There are
other features other than just word, previous word, and the shape (first
letter capitalized, all letters capitalized).  I think the name finder uses
part of speech also.


   So you see that it is not a name lookup table, but dependent on the
previous classification of words earlier in the sentence.  Therefore, you
must have sentences. Does that help?
Daniel


Daniel Russ, Ph.D.
Staff Scientist, Office of Intramural Research
Center for Information Technology
National Institutes of Health
U.S. Department of Health and Human Services
12 South Drive
Bethesda,  MD 20892-5624

On Aug 25, 2016, at 9:55 AM, Damiano Porta <da...@gmail.com><mailto:
damianoporta@gmail.com<ma...@gmail.com>>> wrote:

Hello everybody!

Could someone explain why should I separate each sentence of my documents
to train my models?
My documents are like resume/cv and the sentences can be very different.
For example a sentence could also be :

1. Name: John
2. Surname: travolta

Etc etc
So my question is. What is the problem if i train ny models
(namefinder,tokenizer) with the complete resume/cv one per line?

Could It be a problem?
In this case when i will like to tokenize the resume and doing the NER i
will simply pass the complete resume text skiping the "sentences detection"
process.

Thanks for your opinion in advance!

Best
Damiano


Re: Is sentence detection process really needed?

Posted by "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov>.
Hi Damiano,
Once you decide not to use the NameFinder there are many ways to attack this problem.  I am taking a shot in the dark at features, you will need to think about this. Maybe for every token (separated by whitespace) you create a feature isDate, maybe previous 2 tokens, next 2 token.    Train the classifier.  Of course this will be a little harder because you have to write all the code to read in the training data, which may not be trivial. Maybe your train example look like

<START:name>John Doe<END:name> 14 Maple Tree Court, Anytown MD 20000, born: <START:birthday>Jan 1, 1850<END:birthday> school: Sept 1861-June 1864 College experience 1901-present NIH-CIT  1864-1901 US Postal Service birthday

you might see that birthdays often follow the word “born”, but the training needs to find it.

Don’t forget that if you have structured data, you can use that to help the classification at any step.  For instance if you already know the last name of the person.  You can add a feature that checks if a token is the last name. So in the example if you know “Doe” is the last name. You could have a feature that checks if the next word is “Doe”.

Daniel



Daniel Russ, Ph.D.
Staff Scientist, Office of Intramural Research
Center for Information Technology
National Institutes of Health
U.S. Department of Health and Human Services
12 South Drive
Bethesda,  MD 20892-5624

On Aug 26, 2016, at 5:57 AM, Damiano Porta <da...@gmail.com>> wrote:

Hi Daniel!

Thank you so much for your opinion.
It makes perfectly sense. But i am still a bit confused about the length of
the sentences.
In a resume there are many names, dates etc etc. So my doubt is regarding
the structure of the sentences because they follow specific patterns
sometimes.

For example i need to extract the personal name, (Who wrote the resume) the
Birthday etc etc.

As You know there are many names and dates inside a resume so i thought
about to write the entire resume as sentence to also train the "position"
less or more of the entities. If i "decompose" all the resume into
sentences i will lose this information. No?

Damiano

Il 25/Ago/2016 16:26, "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov>> ha
scritto:

Hi Damiano,

    Everyone can feel feel to correct my ignorance but I view the the
name finder as follows.

    I look at it as walking down the sentence and classifying words as
“NOT IN NAME”  until I hit the start of a name than it is “START NAME”,
Followed by “STILL IN NAME” until “NOT IN NAME”.  Take the sentence “Did
John eat the stew”.  Starting with the first word in the sentence decide
what are the odds that the first word starts a name (given that it is the
first word happens to be “Did” in a sentence, with a capital but not all
caps) starts a person’s name.  Then go to then next word in the sentence.
If the first word was not in a name, what are the odds that the second word
starts a name (given that the previous word did not start a name, the word
starts with a capital (but not all capital), the word is John, and the
previous word is “Did”).  If it decides that we are starting a name at
“John”, we are now looking for the end.  What are the odds that “eat” is
part of the name given that [“Did”: was not part of the name, was
capitalized] and that [“John”: was the first word in the name, was
capitalized].   You are essentially classifying [Did <- OTHER] [John
<-START] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  If it was “Did John
Smith eat the stew”.  You would have [Did <- OTHER] [John
<-START][Smith<-IN] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  There are
other features other than just word, previous word, and the shape (first
letter capitalized, all letters capitalized).  I think the name finder uses
part of speech also.


   So you see that it is not a name lookup table, but dependent on the
previous classification of words earlier in the sentence.  Therefore, you
must have sentences. Does that help?
Daniel


Daniel Russ, Ph.D.
Staff Scientist, Office of Intramural Research
Center for Information Technology
National Institutes of Health
U.S. Department of Health and Human Services
12 South Drive
Bethesda,  MD 20892-5624

On Aug 25, 2016, at 9:55 AM, Damiano Porta <da...@gmail.com><mailto:
damianoporta@gmail.com<ma...@gmail.com>>> wrote:

Hello everybody!

Could someone explain why should I separate each sentence of my documents
to train my models?
My documents are like resume/cv and the sentences can be very different.
For example a sentence could also be :

1. Name: John
2. Surname: travolta

Etc etc
So my question is. What is the problem if i train ny models
(namefinder,tokenizer) with the complete resume/cv one per line?

Could It be a problem?
In this case when i will like to tokenize the resume and doing the NER i
will simply pass the complete resume text skiping the "sentences detection"
process.

Thanks for your opinion in advance!

Best
Damiano




Re: Is sentence detection process really needed?

Posted by Damiano Porta <da...@gmail.com>.
Hi Daniel!

Thank you so much for your opinion.
It makes perfectly sense. But i am still a bit confused about the length of
the sentences.
In a resume there are many names, dates etc etc. So my doubt is regarding
the structure of the sentences because they follow specific patterns
sometimes.

For example i need to extract the personal name, (Who wrote the resume) the
Birthday etc etc.

As You know there are many names and dates inside a resume so i thought
about to write the entire resume as sentence to also train the "position"
less or more of the entities. If i "decompose" all the resume into
sentences i will lose this information. No?

Damiano

Il 25/Ago/2016 16:26, "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov> ha
scritto:

> Hi Damiano,
>
>      Everyone can feel feel to correct my ignorance but I view the the
> name finder as follows.
>
>      I look at it as walking down the sentence and classifying words as
> “NOT IN NAME”  until I hit the start of a name than it is “START NAME”,
> Followed by “STILL IN NAME” until “NOT IN NAME”.  Take the sentence “Did
> John eat the stew”.  Starting with the first word in the sentence decide
> what are the odds that the first word starts a name (given that it is the
> first word happens to be “Did” in a sentence, with a capital but not all
> caps) starts a person’s name.  Then go to then next word in the sentence.
> If the first word was not in a name, what are the odds that the second word
> starts a name (given that the previous word did not start a name, the word
> starts with a capital (but not all capital), the word is John, and the
> previous word is “Did”).  If it decides that we are starting a name at
> “John”, we are now looking for the end.  What are the odds that “eat” is
> part of the name given that [“Did”: was not part of the name, was
> capitalized] and that [“John”: was the first word in the name, was
> capitalized].   You are essentially classifying [Did <- OTHER] [John
> <-START] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  If it was “Did John
> Smith eat the stew”.  You would have [Did <- OTHER] [John
> <-START][Smith<-IN] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  There are
> other features other than just word, previous word, and the shape (first
> letter capitalized, all letters capitalized).  I think the name finder uses
> part of speech also.
>
>
>     So you see that it is not a name lookup table, but dependent on the
> previous classification of words earlier in the sentence.  Therefore, you
> must have sentences. Does that help?
> Daniel
>
>
> Daniel Russ, Ph.D.
> Staff Scientist, Office of Intramural Research
> Center for Information Technology
> National Institutes of Health
> U.S. Department of Health and Human Services
> 12 South Drive
> Bethesda,  MD 20892-5624
>
> On Aug 25, 2016, at 9:55 AM, Damiano Porta <damianoporta@gmail.com<mailto:
> damianoporta@gmail.com>> wrote:
>
> Hello everybody!
>
> Could someone explain why should I separate each sentence of my documents
> to train my models?
> My documents are like resume/cv and the sentences can be very different.
> For example a sentence could also be :
>
> 1. Name: John
> 2. Surname: travolta
>
> Etc etc
> So my question is. What is the problem if i train ny models
> (namefinder,tokenizer) with the complete resume/cv one per line?
>
> Could It be a problem?
> In this case when i will like to tokenize the resume and doing the NER i
> will simply pass the complete resume text skiping the "sentences detection"
> process.
>
> Thanks for your opinion in advance!
>
> Best
> Damiano
>
>

Re: Is sentence detection process really needed?

Posted by "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov>.
Hi Damiano,

     Everyone can feel feel to correct my ignorance but I view the the name finder as follows.

     I look at it as walking down the sentence and classifying words as “NOT IN NAME”  until I hit the start of a name than it is “START NAME”, Followed by “STILL IN NAME” until “NOT IN NAME”.  Take the sentence “Did John eat the stew”.  Starting with the first word in the sentence decide what are the odds that the first word starts a name (given that it is the first word happens to be “Did” in a sentence, with a capital but not all caps) starts a person’s name.  Then go to then next word in the sentence.  If the first word was not in a name, what are the odds that the second word starts a name (given that the previous word did not start a name, the word starts with a capital (but not all capital), the word is John, and the previous word is “Did”).  If it decides that we are starting a name at “John”, we are now looking for the end.  What are the odds that “eat” is part of the name given that [“Did”: was not part of the name, was capitalized] and that [“John”: was the first word in the name, was capitalized].   You are essentially classifying [Did <- OTHER] [John <-START] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  If it was “Did John Smith eat the stew”.  You would have [Did <- OTHER] [John <-START][Smith<-IN] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  There are other features other than just word, previous word, and the shape (first letter capitalized, all letters capitalized).  I think the name finder uses part of speech also.


    So you see that it is not a name lookup table, but dependent on the previous classification of words earlier in the sentence.  Therefore, you must have sentences. Does that help?
Daniel


Daniel Russ, Ph.D.
Staff Scientist, Office of Intramural Research
Center for Information Technology
National Institutes of Health
U.S. Department of Health and Human Services
12 South Drive
Bethesda,  MD 20892-5624

On Aug 25, 2016, at 9:55 AM, Damiano Porta <da...@gmail.com>> wrote:

Hello everybody!

Could someone explain why should I separate each sentence of my documents
to train my models?
My documents are like resume/cv and the sentences can be very different.
For example a sentence could also be :

1. Name: John
2. Surname: travolta

Etc etc
So my question is. What is the problem if i train ny models
(namefinder,tokenizer) with the complete resume/cv one per line?

Could It be a problem?
In this case when i will like to tokenize the resume and doing the NER i
will simply pass the complete resume text skiping the "sentences detection"
process.

Thanks for your opinion in advance!

Best
Damiano